99
Visual Recognition in Primates and Machines Tomaso Poggio (with Thomas Serre) McGovern Institute for Brain Research Center for Biological and Computational Learning Department of Brain & Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 02139 USA NIPS’07 tutorial (preliminary)

NIPS2007: visual recognition in primates and machines

  • Upload
    zukun

  • View
    475

  • Download
    3

Embed Size (px)

Citation preview

Page 1: NIPS2007: visual recognition in primates and machines

Visual Recognition in Primates and Machines

Tomaso

Poggio (with Thomas Serre)McGovern Institute for Brain Research

Center for Biological and Computational LearningDepartment of Brain amp Cognitive Sciences

Massachusetts Institute of TechnologyCambridge MA 02139 USA

NIPSrsquo07 tutorial

(preliminary)

Motivation for studying vision trying to understand how the brain works

bull

Old dream of all philosophers and more recently of AI ndash

understand how the brain works

ndash

make intelligent machines

This tutorial using a class of models to summarizeinterpret

experimental results

bull

Models are cartoons of reality eg

Bohrrsquos model of the hydrogen atom

bull

All models are ldquowrongrdquobull

Some models can be useful summaries of data and some can be a good starting point for more complete theories

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

How does visual cortex solve this problem How can computers solve this problem

Desimone amp Ungerleider

1989

dorsal streamldquowhererdquo

ventral streamldquowhatrdquo

Presenter
Presentation Notes
In first approximation Two streams different tasks13dorsal where spatial13ventral what identity13we are interested in OR (triangle theorylt-gtappplt-gtns and itrsquos also sexy) ventral stream How can the brain recognize 3D objects even under different viewpoints13--gt look whatrsquos happening in the ventral stream How do we know allthis13--gt Jon What do ns do amp how can we analyze it

A ldquofeedforwardrdquo version of the problem rapid categorization

Movie courtesy of Jim DiCarlo

Biederman

1972 Potter 1975 Thorpe et al 1996

SHOW RSVP MOVIE

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 2: NIPS2007: visual recognition in primates and machines

Motivation for studying vision trying to understand how the brain works

bull

Old dream of all philosophers and more recently of AI ndash

understand how the brain works

ndash

make intelligent machines

This tutorial using a class of models to summarizeinterpret

experimental results

bull

Models are cartoons of reality eg

Bohrrsquos model of the hydrogen atom

bull

All models are ldquowrongrdquobull

Some models can be useful summaries of data and some can be a good starting point for more complete theories

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

How does visual cortex solve this problem How can computers solve this problem

Desimone amp Ungerleider

1989

dorsal streamldquowhererdquo

ventral streamldquowhatrdquo

Presenter
Presentation Notes
In first approximation Two streams different tasks13dorsal where spatial13ventral what identity13we are interested in OR (triangle theorylt-gtappplt-gtns and itrsquos also sexy) ventral stream How can the brain recognize 3D objects even under different viewpoints13--gt look whatrsquos happening in the ventral stream How do we know allthis13--gt Jon What do ns do amp how can we analyze it

A ldquofeedforwardrdquo version of the problem rapid categorization

Movie courtesy of Jim DiCarlo

Biederman

1972 Potter 1975 Thorpe et al 1996

SHOW RSVP MOVIE

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 3: NIPS2007: visual recognition in primates and machines

This tutorial using a class of models to summarizeinterpret

experimental results

bull

Models are cartoons of reality eg

Bohrrsquos model of the hydrogen atom

bull

All models are ldquowrongrdquobull

Some models can be useful summaries of data and some can be a good starting point for more complete theories

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

How does visual cortex solve this problem How can computers solve this problem

Desimone amp Ungerleider

1989

dorsal streamldquowhererdquo

ventral streamldquowhatrdquo

Presenter
Presentation Notes
In first approximation Two streams different tasks13dorsal where spatial13ventral what identity13we are interested in OR (triangle theorylt-gtappplt-gtns and itrsquos also sexy) ventral stream How can the brain recognize 3D objects even under different viewpoints13--gt look whatrsquos happening in the ventral stream How do we know allthis13--gt Jon What do ns do amp how can we analyze it

A ldquofeedforwardrdquo version of the problem rapid categorization

Movie courtesy of Jim DiCarlo

Biederman

1972 Potter 1975 Thorpe et al 1996

SHOW RSVP MOVIE

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 4: NIPS2007: visual recognition in primates and machines

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

How does visual cortex solve this problem How can computers solve this problem

Desimone amp Ungerleider

1989

dorsal streamldquowhererdquo

ventral streamldquowhatrdquo

Presenter
Presentation Notes
In first approximation Two streams different tasks13dorsal where spatial13ventral what identity13we are interested in OR (triangle theorylt-gtappplt-gtns and itrsquos also sexy) ventral stream How can the brain recognize 3D objects even under different viewpoints13--gt look whatrsquos happening in the ventral stream How do we know allthis13--gt Jon What do ns do amp how can we analyze it

A ldquofeedforwardrdquo version of the problem rapid categorization

Movie courtesy of Jim DiCarlo

Biederman

1972 Potter 1975 Thorpe et al 1996

SHOW RSVP MOVIE

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 5: NIPS2007: visual recognition in primates and machines

The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

How does visual cortex solve this problem How can computers solve this problem

Desimone amp Ungerleider

1989

dorsal streamldquowhererdquo

ventral streamldquowhatrdquo

Presenter
Presentation Notes
In first approximation Two streams different tasks13dorsal where spatial13ventral what identity13we are interested in OR (triangle theorylt-gtappplt-gtns and itrsquos also sexy) ventral stream How can the brain recognize 3D objects even under different viewpoints13--gt look whatrsquos happening in the ventral stream How do we know allthis13--gt Jon What do ns do amp how can we analyze it

A ldquofeedforwardrdquo version of the problem rapid categorization

Movie courtesy of Jim DiCarlo

Biederman

1972 Potter 1975 Thorpe et al 1996

SHOW RSVP MOVIE

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 6: NIPS2007: visual recognition in primates and machines

How does visual cortex solve this problem How can computers solve this problem

Desimone amp Ungerleider

1989

dorsal streamldquowhererdquo

ventral streamldquowhatrdquo

Presenter
Presentation Notes
In first approximation Two streams different tasks13dorsal where spatial13ventral what identity13we are interested in OR (triangle theorylt-gtappplt-gtns and itrsquos also sexy) ventral stream How can the brain recognize 3D objects even under different viewpoints13--gt look whatrsquos happening in the ventral stream How do we know allthis13--gt Jon What do ns do amp how can we analyze it

A ldquofeedforwardrdquo version of the problem rapid categorization

Movie courtesy of Jim DiCarlo

Biederman

1972 Potter 1975 Thorpe et al 1996

SHOW RSVP MOVIE

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 7: NIPS2007: visual recognition in primates and machines

A ldquofeedforwardrdquo version of the problem rapid categorization

Movie courtesy of Jim DiCarlo

Biederman

1972 Potter 1975 Thorpe et al 1996

SHOW RSVP MOVIE

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 8: NIPS2007: visual recognition in primates and machines

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

Modified from (Gross 1998)

A model of the ventral stream which is also an algorithm

[software available online]

Presenter
Presentation Notes
For the past several years we have developed a computational theory of object recognition in the ventral stream of the visual cortex 13The model closely follows the organization of the visual cortex (color encodes correspondence between stageslayers of the model and cortical areas)13Units in the model mimic as closely as possible the tuning properties of cells in corresponding cortical areas That is the parameters governing the tuning properties of the units (eg frequency bandwidth RF sizes complexity of the preferred stimulus etc) were constrained by neural data

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 9: NIPS2007: visual recognition in primates and machines

hellipsolves the problem (if mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 10: NIPS2007: visual recognition in primates and machines

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 11: NIPS2007: visual recognition in primates and machines

Object recognition for computer vision personal historical perspective

200019951990Viol

a amp Jo

nes 2

001

LeCun

et al

1998

Sch

neide

rman

amp Kan

ade 19

98 R

owley

Balu

jaamp K

anad

e 1998

Brune

lliamp P

oggio

1993

Turk

amp Pen

tland

1991

Sung amp

Pog

gio 19

94

Belong

ieamp M

alik 20

02 A

rgaw

alamp R

oth 20

02 U

llman

e

al 20

02

Face detection

Face identification

Car detection

Pedestrian detection

Multi-class multi-objects

Digit recognitionPer

ona an

d coll

eagu

es 19

96-n

owMoh

an P

apag

eorg

iouamp P

oggio

1999

Amit a

nd G

eman

1999

Osuna

amp Giro

si19

97

Best CVPRrsquo07 paper 10 yrs ago

Ferg

us et

al 20

03

Kanad

e 1974

hellip

Beimer

amp Pog

gio 19

95

Schne

iderm

anamp K

anad

e 2000

Torra

lbaet

al 20

04

hellip Many more excellent algorithms in the past few yearshellip

hellip

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 12: NIPS2007: visual recognition in primates and machines

Examples Learning Object Detection Finding Frontal Faces

bull

Training Databasebull

1000+ Real 3000+ VIRTUAL

bull

500000+ Non-Face Pattern

Sung amp Poggio 1995

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 13: NIPS2007: visual recognition in primates and machines

~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now

becoming a product (MobilEye)

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 14: NIPS2007: visual recognition in primates and machines

Object recognition in cortex Historical perspective

198019701960

V1 catV1 monkeyExstrastriate

cortexIT-STS

Schille

r amp Le

e 199

1

Gross

et al

1969

1990

Hubel

amp Wies

el 19

62

Hubel

amp Wies

el 19

65

Hubel

amp Wies

el 19

59

Hubel

amp Wies

el 19

77

Kobata

keamp T

anak

a 199

4

Unger

leide

r amp M

Ishkin

1982

Per

rett R

olls e

t al 1

982

Zeki

1973

Schwar

tz et

al 19

83

Desim

one e

t al 1

984

Logo

thetis

et al

1995

hellip Much progress in the past 10 yrs

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 15: NIPS2007: visual recognition in primates and machines

Some personal history First step in developing a model

learning to recognize 3D objects in IT cortex

Poggio amp Edelman 1990

Examples of Visual Stimuli

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 16: NIPS2007: visual recognition in primates and machines

An idea for a module for view-invariant identification

Architecture that accounts for invariances to 3D effects (gt1 view needed to learn)

Regularization Network (GRBF)with Gaussian kernels

View Angle

VIEW- INVARIANT

OBJECT- SPECIFIC

UNIT

Prediction neurons become

view-tuned through learning

Poggio amp Edelman 1990

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 17: NIPS2007: visual recognition in primates and machines

Learning to Recognize 3D Objects in IT Cortex

Logothetis

Pauls

amp Poggio 1995

Examples of Visual Stimuli

After human psychophysics (Buelthoff Edelman Tarr Sinha hellip) which supports models based on view-tuned units

hellip physiology

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 18: NIPS2007: visual recognition in primates and machines

Recording Sites in Anterior IT

LUNLAT

IOS

STS

AMTSLATSTS

AMTS

Ho=0

Logothetis Pauls

amp Poggio 1995

hellipneurons tuned to faces are intermingled

nearbyhellip

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 19: NIPS2007: visual recognition in primates and machines

Neurons tuned to object views as predicted by model

Logothetis

Pauls

amp Poggio 1995

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 20: NIPS2007: visual recognition in primates and machines

A ldquoView-Tunedrdquo IT Cell

12 7224 8448 10860 12036 96

12 24 36 48 60 72 84 96 108 120 132 168o o o o o o o o o o o o

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120

Distractors

Target Views60

spi

kes

sec

800 msec

-108 -96 -84 -72 -60 -48 -36 -24 -12 0-168 -120 oo o o o o o o o oo o

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
VT cell found in the experiment13ltshowgtpref StimspecificVP tuning

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 21: NIPS2007: visual recognition in primates and machines

But also view-invariant object-specific neurons (5 of them over 1000 recordings)

Logothetis

Pauls

amp Poggio 1995

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 22: NIPS2007: visual recognition in primates and machines

Scale Invariant Responses of an IT Neuron

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Sik

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

Spi

kes

sec

0

76

0 2000 3000Time (msec)

1000

ik

0

76

40 deg(x 16)

475 deg(x 19)

55 deg(x 22)

625 deg(x 25)

25 deg(x 10)

10 deg(x 04)

175 deg(x 07)

325 deg(x 13)

View-tuned cells scale invariance (one training view only) motivates present model

Logothetis

Pauls

amp Poggio 1995

Presenter
Presentation Notes
pretty amazing

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 23: NIPS2007: visual recognition in primates and machines

From ldquoHMAXrdquo

to the model nowhellip

Riesenhuber amp Poggio 1999 2000 Serre Kouh

Cadieu

Knoblich

Kreiman amp Poggio 2005 Serre Oliva Poggio 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 24: NIPS2007: visual recognition in primates and machines

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 25: NIPS2007: visual recognition in primates and machines

Neural Circuits

Source Modified from Jody Culhamrsquos

web slides

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 26: NIPS2007: visual recognition in primates and machines

Neuron basics

spikes

INPUT= pulses or graded potentials

COMPUTATION = Analog

OUTPUT = Chemical

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 27: NIPS2007: visual recognition in primates and machines

Some numbers

bull

Human Brainndash

1011-1012

neurons (1 million flies )

ndash

1014- 1015

synapses

bull

Neuronndash

Fundamental space dimension

bull

fine dendrites 01 micro

diameter lipid bilayer

membrane 5 nm thick specific proteins pumps channels receptors enzymes

ndash

Fundamental time length 1 msec

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 28: NIPS2007: visual recognition in primates and machines

The cerebral cortex

Human

MacaqueThickness

3 ndash 4 mm

1 ndash 2 mmTotal surface area

~1600 cm2 ~160 cm2

(both sides)

(~50cm diam)

(~15cm diam)

Neurons mmsup2

~10⁵ mm2 ~ 10⁵ mm2 Total cortical neurons

~2 x 1010

~2 x 109

Visual cortex

300 ndash

500 cm2

80+cm2Visual Neurons

~4 x 109 ~109 neurons

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 29: NIPS2007: visual recognition in primates and machines

Gross Brain Anatomy

A large percentage of the cortex devoted to vision

Presenter
Presentation Notes
In brain different regions are involved in different tasks13(and the brain areas are conveniently colored to reflect this toohellip)13vision audition somsens motorhellip Learning amp adaptation everywhere13---gt Q What are the compmech13Why can we hope that there are common computational mech13---gt common hardware CORTEX

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 30: NIPS2007: visual recognition in primates and machines

The Visual System

[Van Essen amp Anderson 1990]

Presenter
Presentation Notes
In our studies of learning amp informations processing in the brain we focus on the visual system --- most extensively researched modality1313areas connected

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 31: NIPS2007: visual recognition in primates and machines

V1 hierarchy of simple and complex cells

LGN-type cells

Simple cells

Complex cells

(Hubel amp Wiesel 1959)

Presenter
Presentation Notes
LGN-type cells dot of light bright dot on a dark bgd and vice-versa1313From dots of light to ldquooriented barsrdquo slightly larger RFs either white bar on black bgd or vice-versa sensitive to the location of the bar within RF1313Bar detector insensitive to polarity of contrast (white on black or black on white)13Larger RF13Insensitive to the location of the bar within RF1313How does that work

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 32: NIPS2007: visual recognition in primates and machines

V1 Orientation selectivity

Hubel amp Wiesel movie

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 33: NIPS2007: visual recognition in primates and machines

V1 Retinotopy

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 34: NIPS2007: visual recognition in primates and machines

(Thorpe and Fabre-Thorpe 2001)

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 35: NIPS2007: visual recognition in primates and machines

Reproduced from [Kobatake amp Tanaka 1994] Reproduced from [Rolls 2004]

Beyond V1 A gradual increase in RF size

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 36: NIPS2007: visual recognition in primates and machines

Reproduced from (Kobatake

amp Tanaka 1994)

Beyond V1 A gradual increase in the complexity of the preferred stimulus

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 37: NIPS2007: visual recognition in primates and machines

AIT Face cells

Reproduced from (Desimone et al 1984)

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 38: NIPS2007: visual recognition in primates and machines

AIT Immediate recognition

Hung Kreiman Poggio amp DiCarlo 2005

identification

categorization

See also Oram

amp Perrett

1992 Tovee

et al 1993 Celebrini

et al 1993 Ringach

et al 1997 Rolls et al 1999 Keysers

et al 2001

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 39: NIPS2007: visual recognition in primates and machines

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 40: NIPS2007: visual recognition in primates and machines

Source Lennie

Maunsell Movshon

The ventral stream

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 41: NIPS2007: visual recognition in primates and machines

(Thorpe and Fabre-Thorpe 2001)

We consider feedforward architecture only

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 42: NIPS2007: visual recognition in primates and machines

Our present model of the ventral stream feedforward accounting only for ldquoimmediate

recognitionrdquo

bull

It is in the family of ldquoHubel-Wieselrdquo

models (Hubel amp Wiesel 1959 Fukushima 1980 Oram

amp Perrett 1993 Wallis amp Rolls 1997 Riesenhuber amp Poggio 1999 Thorpe 2002 Ullman

et al 2002 Mel 1997 Wersing

and Koerner 2003 LeCun

et al 1998 Amit

amp Mascaro

2003 Deco amp Rolls 2006hellip)

bull

As a biological model of object recognition in the ventral stream it is perhaps the most quantitative and faithful to known biology (though many detailsfacts are unknown or still to be incorporated)

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 43: NIPS2007: visual recognition in primates and machines

Two key computations

Unit types Pooling Computation Operation

Simple Selectivity

template matching

Gaussian-

tuning and-like

Complex Invariance Soft-max or-like

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 44: NIPS2007: visual recognition in primates and machines

Max-like operation (or-like)

Complex units

Gaussian-like tuning operation (and-like)

Simple units

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 45: NIPS2007: visual recognition in primates and machines

Gaussian tuning

Gaussian tuning in IT around 3D views

Logothetis

Pauls amp Poggio 1995

Gaussian tuning in V1 for orientation

Hubel amp Wiesel 1958

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 46: NIPS2007: visual recognition in primates and machines

Max-like operation

Max-like behavior in V1

Lampl

Ferster Poggio amp Riesenhuber 2004 see also Finn Prieber amp Ferster 2007

Gawne

amp Martin 2002

Max-like behavior in V4

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 47: NIPS2007: visual recognition in primates and machines

(Knoblich Koch Poggio in prep Kouh amp Poggio 2007 Knoblich Bouvrie Poggio 2007)

Biophys implementation

bull

Max and Gaussian-like tuning can be approximated with same canonical circuit using shunting inhibition

Presenter
Presentation Notes
ADD CABLE AND CITE RELEVANT PEOPLE

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 48: NIPS2007: visual recognition in primates and machines

Of the same form as model of MT (Rust et al Nature Neuroscience 2007

Can be implemented by shunting inhibition (Grossberg

1973 Reichardt

et al 1983 Carandini

and Heeger 1994) and spike threshold variability (Anderson et al 2000 Miller and Troyer 2002)

Adelson

and Bergen (see also Hassenstein

and Reichardt 1956)

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 49: NIPS2007: visual recognition in primates and machines

bull

Generic overcomplete dictionary of reusable shape

components (from V1 to IT) provide unique representation ndash

Unsupervised learning (from ~10000 natural images) during a developmental-like stage

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning ~ Gaussian RBF

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 50: NIPS2007: visual recognition in primates and machines

S2 units

bull

Features of moderate complexity (n~1000 types)

bull

Combination of V1-like complex units at different orientations

bull

Synaptic weights w learned from natural images

bull

5-10 subunits chosen at random from all possible afferents (~100-1000)

stronger facilitation

stronger suppression

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 51: NIPS2007: visual recognition in primates and machines

Nature Neuroscience -

10 1313 -

1321 (2007) Published online 16 September 2007 | doi101038nn1975

Neurons in monkey visual area V2 encode combinations of orientationsAkiyuki Anzai Xinmiao Peng amp David C Van Essen

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 52: NIPS2007: visual recognition in primates and machines

C2 units

bull

Same selectivity as S2 units but increased tolerance to position and size of preferred stimulus

bull

Local pooling over S2 units with same selectivity but slightly different positions and scales

bull

A prediction to be tested S2 units in V2 and C2 units in V4

Presenter
Presentation Notes
obviously the existence of simple vs complex units is a prediction from the model which still needs to be shown experimentally1313One difficulty is that it would be very hard to tell apart say a S2 from a C2 unit you would have to be able to isolate each cell subunit Then the S2 units which project onto a C2 units all have the same selectivity which is also they share the same selectivity which is also the selectivity of the C2 unit onto which they project So the only difference between S2 and C2 unit is simply the range of invariance (position and scale)1313To a first approximation this is what was recently reported by Hegde and VanEssen13

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 53: NIPS2007: visual recognition in primates and machines

A loose hierarchy

bull

Bypass routes along with main routes bull

From V2 to TEO (bypassing V4) (Morel amp Bullier 1990 Baizer et al 1991 Distler et al 1991 Weller amp Steele 1992 Nakamura et al 1993 Buffalo et al 2005)

bull

From V4 to TE (bypassing TEO) (Desimone et al 1980 Saleem et al 1992)

bull

ldquoReplicationrdquo

of simpler selectivities from lower to higher areas

bull

Richer dictionary of features with various levels of selectivity and invariance

Presenter
Presentation Notes
In an even more extrem version Later I am going to show that this is important

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 54: NIPS2007: visual recognition in primates and machines

bull

V1bull

Simple and complex cells tuning

(Schiller et al 1976 Hubel amp Wiesel 1965 Devalois et al 1982)bull

MAX operation in subset of complex cells (Lampl et al 2004)

bull

V4bull

Tuning for two-bar stimuli

(Reynolds Chelazzi amp Desimone 1999)bull

MAX operation

(Gawne et al 2002)bull

Two-spot interaction

(Freiwald et al 2005)bull

Tuning for boundary conformation (Pasupathy amp Connor 2001 Cadieu

et al 2007)bull

Tuning for Cartesian and non-Cartesian gratings

(Gallant et al 1996)

bull

ITbull

Tuning and invariance properties

(Logothetis et al 1995)bull

Differential role of IT and PFC in categorization

(Freedman et al 2001 2002 2003)bull

Read out data (Hung Kreiman Poggio amp DiCarlo 2005)bull

Pseudo-average effect in IT

(Zoccolan Cox amp DiCarlo 2005 Zoccolan Kouh Poggio amp DiCarlo 2007)

bull

Humanbull

Rapid categorization (Serre Oliva Poggio 2007)bull

Face processing (fMRI + psychophysics)

(Riesenhuber et al 2004 Jiang et al 2006)

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Comparison w| neural data

Presenter
Presentation Notes
-- For the past several years we have performed a number of quantitative comparisons between the tuning of units in the model and cells in corresponding cortical areas1313Here there are 3 colors for 3 types of data13Yellow is for data that I used to actuallyconstrain the model Here those are V1 data that I used to get a population of V1 simple and complex units which mimics the tuning properties of cortical cells1313Then there are the green data which were already predicted by the original model and which are still consistent with the new model1313Then the red color indicates data which are consistent with the new model These are data which in some cases already existed and were provided to us by the authors of the corresponding studies There are also data (give example) which were actually collected in collaboration with experimentalists (eg Miller DiCarlo) following a prediction from the model1313I am not going to be able to guide you through all of these data but I am going to show you a few key examples 131313

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 55: NIPS2007: visual recognition in primates and machines

Comparison w| V4

Pasupathy amp Connor 2001

Tuning for curvature and

boundary conformations

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 56: NIPS2007: visual recognition in primates and machines

V4 neuron tuned to boundary conformations

ρ

= 078

Most similar model C2 unit

Pasupathy amp Connor 1999

No parameter fitting

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Finally I would like to point out that the learning rule generates S2 units that seem to be congruent with V4 explain figures etchellip Charles Cadieu and Minjoon Kouh in our lab have shown that the population of S2C2 units generated is compatible with data from several group (at the population level) We can talk about it in the discussion (I have slides) if everybody is interestedhellip

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 57: NIPS2007: visual recognition in primates and machines

J Neurophysiol 98 1733-1750 2007 First published June 27 2007

A Model of V4 Shape Selectivity and InvarianceCharles Cadieu Minjoon Kouh Anitha Pasupathy Charles E Connor Maximilian

Riesenhuber and Tomaso Poggio

Presenter
Presentation Notes
More recently Charles Cadieu and Minjoon Kouh pushed the analysis further and showed that it is indeed possible to fit the model C2 units to single cells in V4 (the weights are no longer learned from natural images but learned using a greedy learning algorithm to fit single cells)

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 58: NIPS2007: visual recognition in primates and machines

V4 neurons (with attention directed away

from receptive field)

(Reynolds

et al 1999)

C2 unitsReference (fixed)

Probe (varying)

= response(probe) ndashresponse(reference)

= re

sp (p

air)

ndashre

sp (r

efer

ence

) Prediction Response of the pair is predicted to fall between the responses elicited by the stimuli alone

(Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005)

Presenter
Presentation Notes
Biased competition model predicts that with attention away line should pass through origin and should get a gt o coef

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 59: NIPS2007: visual recognition in primates and machines

Agreement w| IT Readout data

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman amp Poggio 2005

Presenter
Presentation Notes
Here is an example for instance the read out data that Gabriel showed earlier The level of invariance to position and scale is actually predicted by the model13One key result from this study is that the visual system is extremely fast Once the visual signal reaches IT which is where they recorded from in the study From the very first spikes one can already read out object category and identity

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 60: NIPS2007: visual recognition in primates and machines

Remarks

bull

The stage that includes (V4-PIT)-AIT-PFC represents a learning network of the Gaussian RBF type that is known (from learning theory) to generalize well

bull

In the theory the stage between IT and lsquorsquoPFCrdquo

is a linear classifier ndash

like the one used in the read-

out experimentsbull

The inputs to IT are a large dictionary of selective and invariant features

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 61: NIPS2007: visual recognition in primates and machines

Rapid categorization

SHOW ANIMAL NON_ANIMAL

MOVIE

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 62: NIPS2007: visual recognition in primates and machines

Database collected by Oliva amp Torralba

Presenter
Presentation Notes
Any task can be made arbitrarily easy or difficulthellip

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 63: NIPS2007: visual recognition in primates and machines

Rapid categorization task (with mask to test feedforward model)

Animal presentor not

30 ms ISI

20 ms

Image

Interval Image-Mask

Mask1f noise

80 ms

Thorpe et al 1996 Van Rullen

amp Koch 2003 Bacon-Mace et al 2005

Presenter
Presentation Notes
We used 30 ms ISI with 20 ms presentation this is equivalent to SO of 50ms This is a fairly long SOA 1313MODEL IS FEEDFOWARD (YOU CAN ASK ME DURING QUESTION TIME)13Mask tries to block feedback in visual cortex

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 64: NIPS2007: visual recognition in primates and machines

hellipsolves the problem (when mask forces feedforward processing)hellip

human-

observers (n = 24) 80

Model 82

Serre Oliva amp Poggio 2007

bull

drsquo~ standardized error rate bull

the higher the drsquo the better the perf

Human 80

Presenter
Presentation Notes
We tried simpler computer vision systems and they could not do the job

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 65: NIPS2007: visual recognition in primates and machines

Further comparisons

bull

Image-by-image correlationndash

Heads ρ=071

ndash

Close-body ρ=084 ndash

Medium-body ρ=071

ndash

Far-body ρ=060

bull

Model predicts level of performance on rotated images (90 deg and inversion)

Serre Oliva amp Poggio PNAS 2007

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 66: NIPS2007: visual recognition in primates and machines

Source Bileschi amp Wolf

The street scene project

Presenter
Presentation Notes
Here is another application of the model The work was performed by Stan Bileschi and Lior Wolf Lior was a postdoc in the lab and this was part of Stanrsquos PhD thesis This is kind of the old AI dream where ideally you would like to input such image as the street scene image on the left and you would like the system to automatically parse the image into its main objects13What they did is to use the features produced by the model that were computed over small windows at all positions and scales in the image that they classified as being either one of 7 possible objects cars pedestrians bikes trees buildings skies and road13

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 67: NIPS2007: visual recognition in primates and machines

The StreetScenes Database

Object car pedestrian bicycle building tree road sky

Labeled Examples 5799 1449 209 5067 4932 3400 2562

3547 Images all taken with the same camera of the same type of scene and hand labeled with the same objects using the same labeling rules

httpcbclmitedusoftware-datasetsstreetscenes

Presenter
Presentation Notes
TODO Add 420 True Labels to this slide

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 68: NIPS2007: visual recognition in primates and machines

Examples

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 69: NIPS2007: visual recognition in primates and machines

Examples

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 70: NIPS2007: visual recognition in primates and machines

Examples

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 71: NIPS2007: visual recognition in primates and machines

Examples

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 72: NIPS2007: visual recognition in primates and machines

bull

HoG (Dalal amp Triggs 2005)

bull

Part-based system (Leibe et al 2004)

bull

Local patch correlation (Torralba et al 2004)

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 73: NIPS2007: visual recognition in primates and machines

Serre Wolf Bileschi

Riesenhuber amp Poggio PAMI 2007

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 74: NIPS2007: visual recognition in primates and machines

1

Problem of visual recognition visual cortex2

Historical background

3

Neurons and areas in the visual system4

Data and feedforward

hierarchical models

5

What is next

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 75: NIPS2007: visual recognition in primates and machines

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 76: NIPS2007: visual recognition in primates and machines

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 77: NIPS2007: visual recognition in primates and machines

bull

Generic dictionary of shape components (from V1 to IT) ndash

Unsupervised learning during a developmental-like stage learning dictionaries of ldquotemplatesrdquo

at different S levels

bull

Task-specific circuits (from IT to PFC)ndash

Supervised learning

Layers of cortical processing units

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 78: NIPS2007: visual recognition in primates and machines

Learning the invariance from temporal continuity

w| T Masquelier amp S Thorpe (CNRS France)

Foldiak

1991 Perrett et al 1984 Wallis amp Rolls 1997 Einhauser

et al 2002 Wiskott amp Sejnowski 2002 Spratling 2005

Simple cells learn correlation in space (at the same time)

Complex cells learn correlation in time

SHOW MOVIE

Presenter
Presentation Notes
~19 hours of video ~10^6 frames13developmental like learning stage

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 79: NIPS2007: visual recognition in primates and machines

What is next

bull

A challenge for physiology disprove basic aspects of the architecture

bull

Extensions to color and stereobull

More sophisticated unsupervised developmental learning in V1 V4 PIT how

bull

Extension to time and videosbull

Extending the simulation to integrate-and-fire neurons (~ 1 billion) and realistic synapses towards the neural code

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 80: NIPS2007: visual recognition in primates and machines

The problemTraining Videos Testing videos

each video~4s 50~100 frames

bend jack jump

pjump run walk

side wave1 wave2

Dataset from (Blank et al 2005)

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 81: NIPS2007: visual recognition in primates and machines

See also (Casile

amp Giese 2005 Sigala

et al 2005)

Previous work recognizing biological motion

using a model of the dorsal stream

Adapted from (Giese amp Poggio 2003)

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 82: NIPS2007: visual recognition in primates and machines

Baseline Our system

KTH Human 813 916

UCSD Mice 756 790

Weiz Human 867 963

Average 812 896

chances 10~20HH Jhuang T Serre L Wolf and T Poggio ICCV 2007

Multi-class recognition accuracy

Presenter
Presentation Notes
Our method has 8 better performance on average

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 83: NIPS2007: visual recognition in primates and machines

What is next beyond feedforward models limitations

Serre Oliva Poggio 2007

Zoccolan

Kouh

Poggio DiCarlo 2007

Reynolds Chelazzi amp Desimone 1999

PsychophysicsV4

IT

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 84: NIPS2007: visual recognition in primates and machines

What is next beyond feedforward models

limitations

bull

Recognition in clutter is increasingly difficultbull

Need for attentional bottleneck (Wolfe 1994) perhaps in V4 (see Gallant and Desimone and models by Walther + Serre)

bull

Notice this is a ldquonovelrdquo

justification for the need of attention

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 85: NIPS2007: visual recognition in primates and machines

model

20 ms SOA (ISI=0 ms)

80 ms SOA (ISI=60 ms)

50 ms SOA (ISI=30 ms)

no mask condition

(Serre Oliva and Poggio PNAS 2007)

Limitations beyond 50 ms model not good enough

Presenter
Presentation Notes
FA ~16 for humans (except no mask with 14) and the model ~1813

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 86: NIPS2007: visual recognition in primates and machines

Ongoing workhellip

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 87: NIPS2007: visual recognition in primates and machines

Attention and cortical

feedbacks

Model implementation of Wolfersquos guided search (1994)

Parallel (feature-based top- down attention) and serial

(spatial attention) to suppress clutter (Tsostos et al)

Chikkerur

Serre Walther Koch amp Poggio in prep

num items

dete

ctio

n ac

urac

y

Face detection scanning vs attention

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 88: NIPS2007: visual recognition in primates and machines

Example Results

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 89: NIPS2007: visual recognition in primates and machines

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 90: NIPS2007: visual recognition in primates and machines

What is next image inference backprojections and attentional mechanisms

bullbull

Normal recognition by humans (for long times) is Normal recognition by humans (for long times) is much much betterbetter

bullbull

Normal vision is Normal vision is much much more than categorization more than categorization or identification image or identification image understandinginferenceparsingunderstandinginferenceparsing

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 91: NIPS2007: visual recognition in primates and machines

bullbull

FeedforwardFeedforward

model + model + backprojectionsbackprojections

implementing implementing featuralfeatural

and the and the spatial attention may improve recognition performance spatial attention may improve recognition performance

bullbull

BackprojectionsBackprojections

also accessroute information infrom lower areas to also accessroute information infrom lower areas to specific taskspecific task--dependent routines in PFC () Open questions dependent routines in PFC () Open questions

----

Which biophysical mechanisms for routinggating Which biophysical mechanisms for routinggating

----

Nature of routines in higher areas (Nature of routines in higher areas (egeg

PFC)PFC)

Attention-based models with high-level specialized routines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 92: NIPS2007: visual recognition in primates and machines

Analysis-by-synthesis models eg

probabilistic inference in the ventral stream neurons represent conditional probabilities of the bottom-up sensory inputs given the top-down hypothesis and converge to globally consistent values

Bayesian models

Lee and Mumford 2003 Dean2005 Rao 2004 Hawkins 2004 Ullman 2007 Hinton 2005

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 93: NIPS2007: visual recognition in primates and machines

What is next

bull

Image inference attentional

or Bayesian models

bull

Why hierarchies Beyond a model towards a theory

bull

Against hierarchies and the ventral stream subcortical

pathways (Bar et al 2006 hellip)

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 94: NIPS2007: visual recognition in primates and machines

How then do the learning machines described in the theory compare with brains

One of the most obvious differences is the ability of people and animals to learn from very few examples

A comparison with real brains offers another related challenge to learning theory The ldquolearning algorithmsrdquowe have described in this paper correspond to one-layer architectures Are hierarchical architectures with more layers justifiable in terms of learning theory

Why hierarchies For instance the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks

There may also be the more fundamental issue of sample complexity Thus our ability of learning from just a few examples and its limitations may be related to the hierarchical architecture of cortex

Notices of the American Mathematical Society (AMS) Vol 50 No 5537-544 2003

The Mathematics of Learning Dealing with DataTomaso

Poggio

and Steve Smale

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 95: NIPS2007: visual recognition in primates and machines

Formalizing the hierarchy towards a theory

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 96: NIPS2007: visual recognition in primates and machines

Smale S T Poggio A Caponnetto and J Bouvrie Derived Distance towards a mathematical theory of visual cortex

CBCL Paper Massachusetts Institute of Technology Cambridge MA November 2007

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 97: NIPS2007: visual recognition in primates and machines

From a model to a theory math results on unsupervised learning of invariances and of a dictionary of shapes

from image sequences

hellip

Caponetto Smale

and Poggio in preparation

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 98: NIPS2007: visual recognition in primates and machines

(Obvious) caution remark

There is still much to do before we understand visionhellip

and the brain

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators
Page 99: NIPS2007: visual recognition in primates and machines

Collaborators

Comparison w| humans

A Oliva

Action recognition

H Jhuang

Attention

S Chikkerur

C Koch

D Walther

Computer vision

bull S Bileschi

bull L Wolf

Learning invariances

bullT Masquelier

bullS Thorpe

T Serre

Model

A Oliva

C Cadieu

U Knoblich

M Kouh

G Kreiman

M Riesenhuber

  • Visual Recognition in Primates and Machines
  • Motivation for studying vision trying to understand how the brain works
  • This tutorial using a class of models to summarizeinterpret experimental results
  • Slide Number 4
  • The problem recognition in natural images (eg ldquois there an animal in the imagerdquo)
  • How does visual cortex solve this problemHow can computers solve this problem
  • A ldquofeedforwardrdquo version of the problemrapid categorization
  • A model of the ventral stream which is also an algorithm
  • hellipsolves the problem (if mask forces feedforward processing)hellip
  • Slide Number 10
  • Object recognition for computer vision personal historical perspective
  • Examples Learning Object Detection Finding Frontal Faces
  • ~10 year old CBCL computer vision work pedestrian detection system in Mercedes test car now becoming a product (MobilEye)
  • Object recognition in cortex Historical perspective
  • Some personal history First step in developing a model learning to recognize 3D objects in IT cortex
  • An idea for a module for view-invariant identification
  • Learning to Recognize 3D Objects in IT Cortex
  • Recording Sites in Anterior IT
  • Neurons tuned to object views as predicted by model
  • A ldquoView-Tunedrdquo IT Cell
  • But also view-invariant object-specific neurons (5 of them over 1000 recordings)
  • View-tuned cells scale invariance (one training view only) motivates present model
  • From ldquoHMAXrdquo to the model nowhellip
  • Slide Number 24
  • Neural Circuits
  • Neuron basics
  • Some numbers
  • The cerebral cortex
  • Gross Brain Anatomy
  • The Visual System
  • V1 hierarchy of simple and complex cells
  • V1 Orientation selectivity
  • V1 Retinotopy
  • Slide Number 34
  • Beyond V1 A gradual increase in RF size
  • Beyond V1 A gradual increase in the complexity of the preferred stimulus
  • AIT Face cells
  • AIT Immediate recognition
  • Slide Number 39
  • The ventral stream
  • We consider feedforward architecture only
  • Our present model of the ventral stream feedforward accounting only for ldquoimmediate recognitionrdquo
  • Two key computations
  • Slide Number 44
  • Gaussian tuning
  • Max-like operation
  • Biophys implementation
  • Slide Number 48
  • Slide Number 49
  • S2 units
  • Slide Number 51
  • C2 units
  • A loose hierarchy
  • Comparison w| neural data
  • Comparison w| V4
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Agreement w| IT Readout data
  • Remarks
  • Rapid categorization
  • Slide Number 62
  • Rapid categorization task (with mask to test feedforward model)
  • hellipsolves the problem (when mask forces feedforward processing)hellip
  • Further comparisons
  • The street scene project
  • The StreetScenes Database
  • Examples
  • Examples
  • Examples
  • Examples
  • Slide Number 72
  • Slide Number 73
  • Slide Number 74
  • What is next
  • What is next
  • Slide Number 77
  • Learning the invariance from temporal continuity
  • What is next
  • The problem
  • Previous workrecognizing biological motion using a model of the dorsal stream
  • Multi-class recognition accuracy
  • What is next beyond feedforward models limitations
  • What is next beyond feedforward models limitations
  • Limitations beyond 50 ms model not good enough
  • Slide Number 86
  • Attention and cortical feedbacks
  • Example Results
  • What is next
  • What is next image inference backprojections and attentional mechanisms
  • Attention-based models with high-level specialized routines
  • Bayesian models
  • What is next
  • Slide Number 94
  • Formalizing the hierarchy towards a theory
  • Slide Number 96
  • Slide Number 97
  • (Obvious) caution remark
  • Collaborators