92
1 Backprop, 25 years later… Garrison W. Cottrell Gary's Unbelievable Research Unit (GURU) Computer Science and Engineering Department Institute for Neural Computation UCSD

Backprop, 25 years later…

  • Upload
    kreeli

  • View
    48

  • Download
    1

Embed Size (px)

DESCRIPTION

Backprop, 25 years later… . Garrison W. Cottrell Gary's Unbelievable Research Unit (GURU) Computer Science and Engineering Department Institute for Neural Computation UCSD. But first…. Hal White passed away March 31st, 2012 - PowerPoint PPT Presentation

Citation preview

Page 1: Backprop, 25 years later…

1

Backprop, 25 years later…

Garrison W. CottrellGary's Unbelievable Research Unit (GURU)Computer Science and Engineering DepartmentInstitute for Neural ComputationUCSD

Page 2: Backprop, 25 years later…

Back propagation, 25 years later

2

But first…Hal White passed away March 31st,

2012

• Hal was “our theoretician of neural nets,” and one of the nicest guys I knew.

• His paper on “A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity” has been cited 15,805 times, and led to him being shortlisted for the Nobel Prize.

• But his paper with Max Stinchcombe: “Multilayer feedforward networks are universal approximators” is his second most-cited paper, at 8,114 cites.

Page 3: Backprop, 25 years later…

Back propagation, 25 years later

3

But first…• In yet another paper (in Neural

Computation, 1989), he wrote

“The premise of this article is that learning procedures used to train artificial neural networks are inherently statistical techniques. It follows that statistical theory can provide considerable insight into the properties, advantages, and disadvantages of different network learning methods…”

This was one of the first papers to make the connection between neural networks and statistical models - and thereby put them on a sound statistical foundation.

Page 4: Backprop, 25 years later…

Back propagation, 25 years later

4

We should also remember…Dave E. Rumelhart passed

away on March 13, 2011

• Many had invented back propagation; few could appreciate as deeply as Dave did what they had when they discovered it.

Page 5: Backprop, 25 years later…

Back propagation, 25 years later

5

What is backpropagation, and why is/was it important?

• We have billions and billions of neurons that somehow work together to create the mind.

• These neurons are connected by 1014 - 1015 synapses, which we think encode the “knowledge” in the network - too many for us to explicitly program them in our models

• Rather we need some way to indirectly indirectly set them via a procedure that will achieve some goal by changing the synaptic strengths (which we call weights).

• This is called learninglearning in these systems.

Page 6: Backprop, 25 years later…

Back propagation, 25 years later

6

Learning: A bit of history

• Frank Rosenblatt studied a simple version of a neural net called a perceptron:• A single layer of processing• Binary output• Can compute simple things like (some) boolean

functions (OR, AND, etc.)

Page 7: Backprop, 25 years later…

Back propagation, 25 years later

7

Learning: A bit of history

net input

output

Page 8: Backprop, 25 years later…

Back propagation, 25 years later

8

Learning: A bit of history

Page 9: Backprop, 25 years later…

Back propagation, 25 years later

9

Learning: A bit of history

• Rosenblatt (1962) discovered a learning rule for perceptrons called the perceptron convergence procedure.

• Guaranteed to learn anything computable (by a two-layer perceptron)

• Unfortunately, not everything was computable (Minsky & Papert, 1969)

Page 10: Backprop, 25 years later…

Back propagation, 25 years later

10

Perceptron Learning Demonstration

• Output activation rule: • First, compute the net input to the output unit:

wixi = net• Then, compute the output as: If net then output = 1

else output = 0

net input

output

Page 11: Backprop, 25 years later…

Back propagation, 25 years later

11

Perceptron Learning Demonstration

• Output activation rule: • First, compute the net input to the output unit:

wixi = net If net then output = 1

else output = 0• Learning rule:

If output is 1 and should be 0, then lower weights to active inputs and raise the threshold ()

If output is 0 and should be 1, then raise weights to active inputs and lower the threshold ()

(“active input” means xi = 1, not 0)

Page 12: Backprop, 25 years later…

Back propagation, 25 years later

12

Characteristics of perceptron learning

• Supervised learning: Gave it a set of input-output examples for it to model the function (a teaching signal)

• Error correction learning: only correct it when it is wrong.

• Random presentation of patterns.

• Slow! Learning on some patterns ruins learning on others.

Page 13: Backprop, 25 years later…

Back propagation, 25 years later

13

Perceptron Learning Made Simple

• Output activation rule: • First, compute the net input to the output unit:

wixi = net If net then output = 1

else output = 0• Learning rule:

If output is 1 and should be 0, then lower weights to active inputs and raise the threshold ()

If output is 0 and should be 1, then raise weights to active inputs and lower the threshold ()

Page 14: Backprop, 25 years later…

Back propagation, 25 years later

14

Perceptron Learning Made Simple

• Learning rule:If output is 1 and should be 0, then lower weights to active

inputs and raise the threshold ()If output is 0 and should be 1, then raise weights to active

inputs and lower the threshold ()• Learning rule:

wi(t+1) = wi(t) + *(teacher - output)*xi

( is the learning rate)• This is known as the delta rule because learning

is based on the delta (difference) between what you did and what you should have done: = (teacher - output)

Page 15: Backprop, 25 years later…

Back propagation, 25 years later

15

Problems with perceptrons

• The learning rule comes with a great guarantee: anything a perceptron can compute, it can learn to compute.

• Problem: Lots of things were not computable, e.g., XOR (Minsky & Papert, 1969)

• Minsky & Papert said: • if you had hidden units, you could compute any boolean

function.• But no learning rule exists for such multilayer networks, and

we don’t think one will ever be discovered.

Page 16: Backprop, 25 years later…

Back propagation, 25 years later

16

Problems with perceptrons

Page 17: Backprop, 25 years later…

Back propagation, 25 years later

17

Aside about perceptrons

• They didn’t have hidden units - but Rosenblatt assumed nonlinear preprocessing!

• Hidden units compute features of the input

• The nonlinear preprocessing is a way to choose features by hand.

• Support Vector Machines essentially do this in a principled way, followed by a (highly sophisticated) perceptron learning algorithm.

Page 18: Backprop, 25 years later…

Back propagation, 25 years later

18

Enter Rumelhart, Hinton, & Williams (1985)

• Discovered a learning rule for networks with hidden units.

• Works a lot like the perceptron algorithm:

• Randomly choose an input-output pattern• present the input, let activation propagate through the

network• give the teaching signal • propagate the error back through the network (hence the

name back propagation) • change the connection strengths according to the error

Page 19: Backprop, 25 years later…

Back propagation, 25 years later

19

Enter Rumelhart, Hinton, & Williams (1985)

• The actual algorithm uses the chain rule of calculus to go downhill in an error measure with respect to the weights• The hidden units must learn features that solve the problem

. . .. . .

. . .. . .

ActivationActivation ErrorError

INPUTSINPUTS

OUTPUTSOUTPUTS

Hidden UnitsHidden Units

Page 20: Backprop, 25 years later…

Back propagation, 25 years later

20

XOR

• Here, the hidden units learned AND and OR - two features that when combined appropriately, can solve the problem

Back Propagation Back Propagation LearningLearning

Random NetworkRandom Network XOR NetworkXOR Network

ANDANDOROR

Page 21: Backprop, 25 years later…

Back propagation, 25 years later

21

XOR

But, depending on initial conditions, there are an infinite number of ways to do XOR - backprop can surprise you with innovative solutions.

Back Propagation Back Propagation LearningLearning

Random NetworkRandom Network

XOR NetworkXOR Network

ANDANDOROR

Page 22: Backprop, 25 years later…

Back propagation, 25 years later

22

Why is/was this wonderful?

• Efficiency• Learns internal representations• Learns internal representations• Learns internal representations• Generalizes to recurrent networksrecurrent networks

Page 23: Backprop, 25 years later…

Back propagation, 25 years later

23

Hinton’s Family Trees example• Idea: Learn to represent relationships between

people that are encoded in a family tree:

Page 24: Backprop, 25 years later…

Back propagation, 25 years later

24

Hinton’s Family Trees example• Idea 2: Learn distributed representations of

concepts: localist outputs

Learn: features of theseentities useful for

solving the task

Input: localist people localist relationsLocalist: one unit “ON” to represent

each item

Page 25: Backprop, 25 years later…

Back propagation, 25 years later

25

People hidden units: Hinton diagram

• The corresponding people in the two trees are above/below one another; english above, italian below

Page 26: Backprop, 25 years later…

Back propagation, 25 years later

26

People hidden units: Hinton diagram

• What does the unit 1 encode?

What is unit 1 encoding?What is unit 1 encoding?

Page 27: Backprop, 25 years later…

Back propagation, 25 years later

27

People hidden units: Hinton diagram

• What does unit 2 encode?

What is unit 2 encoding?What is unit 2 encoding?

Page 28: Backprop, 25 years later…

Back propagation, 25 years later

28

People hidden units: Hinton diagram

• Unit 6?

What is unit 6 encoding?What is unit 6 encoding?

Page 29: Backprop, 25 years later…

Back propagation, 25 years later

29

Relation units

• What does the upper right one code?

Page 30: Backprop, 25 years later…

Back propagation, 25 years later

30

Lessons• The network learns features in the service

of the task - i.e., it learns features on its own.

• This is useful if we don’t know what the features ought to be.

• Can explain some human phenomena

Page 31: Backprop, 25 years later…

Back propagation, 25 years later

31

Another example• In the next example(s), I make two points:

• The perceptron algorithm is still useful!

• Representations learned in the service of the task can explain the “Visual Expertise Mystery”

Page 32: Backprop, 25 years later…

Back propagation, 25 years later

32

A Face Processing SystemA Face Processing System

PCA

.

.

.

.

.

.

HappySadAfraidAngrySurprisedDisgustedNeural

Net

Pixel(Retina)

Level

Object(IT)

Level

Perceptual(V1)

Level

CategoryLevel

GaborFiltering

Page 33: Backprop, 25 years later…

Back propagation, 25 years later

33

The Face Processing SystemThe Face Processing System

PCA

.

.

.

.

.

.

BobCarolTedAlice

NeuralNet

Pixel(Retina)

Level

Object(IT)

Level

Perceptual(V1)

Level

CategoryLevel

GaborFiltering

Page 34: Backprop, 25 years later…

Back propagation, 25 years later

34

The Face Processing SystemThe Face Processing System

PCA

.

.

.

.

.

.

BobCarolTedAlice

NeuralNet

Pixel(Retina)

Level

Object(IT)

Level

Perceptual(V1)

Level

CategoryLevel

GaborFiltering

Page 35: Backprop, 25 years later…

Back propagation, 25 years later

35

The Face Processing SystemThe Face Processing System

PCA...

.

.

.

GaborFiltering

BobCarolTedCupCanBookNeural

Net

Pixel(Retina)

Level

Object(IT)

Level

Perceptual(V1)

Level

CategoryLevelFeatureFeature

levellevel

Page 36: Backprop, 25 years later…

Back propagation, 25 years later

36

The Gabor Filter Layer• Basic feature: the 2-D Gabor wavelet filter (Daugman, 85):

• These model the processing in early visual areas

Convolution

*

Magnitudes

Subsample in a 29x36 grid

Page 37: Backprop, 25 years later…

Back propagation, 25 years later

37

Principal Components Analysis• The Gabor filters give us 40,600 numbers• We use PCA to reduce this to 50 numbers• PCA is like Factor Analysis: It finds the underlying

directions of Maximum Variance• PCA can be computed in a neural network through

a competitive Hebbian learning mechanism• Hence this is also a biologically plausible

processing step• We suggest this leads to representations similar to

those in Inferior Temporal cortex

Page 38: Backprop, 25 years later…

Back propagation, 25 years later

38

How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &

Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations

(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)

...

Holons (Gestalt layer)

Input fromInput fromPerceptual LayerPerceptual Layer

Page 39: Backprop, 25 years later…

Back propagation, 25 years later

39

How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &

Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations

(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)

...

Holons (Gestalt layer)

Input fromInput fromPerceptual LayerPerceptual Layer

Page 40: Backprop, 25 years later…

Back propagation, 25 years later

40

How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &

Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations

(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)

...

Holons (Gestalt layer)

Input fromInput fromPerceptual LayerPerceptual Layer

Page 41: Backprop, 25 years later…

Back propagation, 25 years later

41

How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &

Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations

(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)

...

Holons (Gestalt layer)

Input fromInput fromPerceptual LayerPerceptual Layer

Page 42: Backprop, 25 years later…

Back propagation, 25 years later

42

How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &

Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations

(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)

...

Holons (Gestalt layer)

Input fromInput fromPerceptual LayerPerceptual Layer

Page 43: Backprop, 25 years later…

Back propagation, 25 years later

43

How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &

Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations

(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)

...

Holons (Gestalt layer)

Input fromInput fromPerceptual LayerPerceptual Layer

Page 44: Backprop, 25 years later…

Back propagation, 25 years later

44

How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &

Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations

(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)

...

Holons (Gestalt layer)

Input fromInput fromPerceptual LayerPerceptual Layer

Page 45: Backprop, 25 years later…

Back propagation, 25 years later

46

The Final Layer: Classification(Cottrell & Fleming 1990; Cottrell & Metcalfe 1990; Padgett & Cottrell 1996;

Dailey & Cottrell, 1999; Dailey et al. 2002)

The holistic representation is then used as input to a categorization network trained by supervised learning.

• Excellent generalization performance demonstrates the sufficiency of the holistic representation for recognition

Holons

CategoriesCategories

...

Output: Cup, Can, Book, Greeble, Face, Bob, Carol, Ted, Happy, Sad, Afraid, etc.Output: Cup, Can, Book, Greeble, Face, Bob, Carol, Ted, Happy, Sad, Afraid, etc.

Input fromInput fromPerceptual LayerPerceptual Layer

Page 46: Backprop, 25 years later…

Back propagation, 25 years later

47

The Final Layer: Classification• Categories can be at different levels: basic, subordinate.• Simple learning rule (~delta rule). It says (mild lie here):

• add inputs to your weights (synaptic strengths) when you are supposed to be on,

• subtract them when you are supposed to be off.• This makes your weights “look like” your favorite

patterns – the ones that turn you on.• When no hidden units => No back propagation of error.• When hidden units: we get task-specific features (most

interesting when we use the basic/subordinate distinction)

Page 47: Backprop, 25 years later…

Back propagation, 25 years later

48

Facial Expression Database• Ekman and Friesen quantified muscle movements (Facial

Actions) involved in prototypical portrayals of happiness, sadness, fear, anger, surprise, and disgust.• Result: the Pictures of Facial Affect Database (1976).• 70% agreement on emotional content by naive human

subjects.• 110 images, 14 subjects, 7 expressions.

Anger, Disgust, Neutral, Surprise, Happiness (twice), Fear, and Sadness This is actor “JJ”: The easiest for humans (and our model) to classify

Page 48: Backprop, 25 years later…

Back propagation, 25 years later

49

Results (Generalization)

• Kendall’s tau (rank order correlation): .667, p=.0441

• Note: This is an emergent property of the model!

ExpressionExpression Network % CorrectNetwork % Correct Human % AgreementHuman % Agreement

HappinessHappiness 100.0%100.0% 98.7%98.7%

SurpriseSurprise 100.0%100.0% 92.4%92.4%

DisgustDisgust 100.0%100.0% 92.3%92.3%

AngerAnger 89.2%89.2% 88.9%88.9%

SadnessSadness 82.9%82.9% 89.2%89.2%

FearFear 66.7%66.7% 87.7%87.7%

AverageAverage 89.9%89.9% 91.6%91.6%

Page 49: Backprop, 25 years later…

Back propagation, 25 years later

50

Correlation of Net/Human Errors

• Like all good Cognitive Scientists, we like our models to make the same mistakes people do!

• Networks and humans have a 6x6 confusion matrix for the stimulus set.

• This suggests looking at the off-diagonal terms: The errors

• Correlation of off-diagonal terms: r = 0.567. [F (1,28) = 13.3; p = 0.0011]

• Again, this correlation is an emergent property of the model: It was not told which expressions were confusing.

Page 50: Backprop, 25 years later…

Back propagation, 25 years later

51

Examining the Net’s Representations

• We want to visualize “receptive fields” in the network.• But the Gabor magnitude representation is noninvertible.• We can learn an approximate inverse mapping, however.• We used linear regression to find the best linear

combination of Gabor magnitude principal components for each image pixel.

• Then projecting each unit’s weight vector into image space with the same mapping visualizes its “receptive field.”

Page 51: Backprop, 25 years later…

Back propagation, 25 years later

52

Examining the Net’s Representations

• The “y-intercept” coefficient for each pixel is simply the average pixel value at that location over all faces, so subtracting the resulting “average face” shows more precisely what the units attend to:

• Apparently local features appear in the global templates.

Page 52: Backprop, 25 years later…

Back propagation, 25 years later

58

Modeling Six-Way Forced Choice

• Overall correlation r=.9416, with NO FIT PARAMETERS!

iness Fear Sadness Disgust Anger Happ-Surprise

Page 53: Backprop, 25 years later…

Back propagation, 25 years later

59

Model Discrimination Scores

• The model fits the data best at a precategorical layer: The layer we call the “object” layer; NOT at the category level

iness Fear Sadness Disgust Anger Happ-Surprise

PERCENT CORRECT DISCRIMINATIONPERCENT CORRECT DISCRIMINATION

HUMANHUMAN

MODELMODELOUTPUTOUTPUTLAYERLAYERR=0.36R=0.36

MODELMODELOBJECTOBJECTLAYERLAYERR=0.61R=0.61

Page 54: Backprop, 25 years later…

Back propagation, 25 years later

62

Reaction Time: Human/Modeliness Fear Sadness Disgust Anger Happ-Surprise

Correlation between model & data: .6771, Correlation between model & data: .6771, p<.001p<.001

HUMANHUMAN SUBJECTS REACTION TIMESUBJECTS REACTION TIME

MODEL REACTION TIME (1 MODEL REACTION TIME (1 - - MAX_OUTPUT)MAX_OUTPUT)

Page 55: Backprop, 25 years later…

Back propagation, 25 years later

63

Mix Detection in the ModelMixed-In Expression Detection

-0.5

-0.3

-0.1

0.1

0.3

0.5

0.7

0.9

1.1

1.3

1.5

10 30 50

Percent Mixing

Average Rank Score

Mixed-in expression(humans)Unrelated expressions(humans)Mixed-in expression(networks)Unrelated expressions(networks)

Can the network account for the continuous data as well as the categorical data? YES.

Page 56: Backprop, 25 years later…

Back propagation, 25 years later

64

Human/Model CircumplexesThese are derived from similarities between images

using non-metric Multi-dimensional scaling.For humans: similarity is correlation between 6-way

forced-choice button push.For networks: similarity is correlation between 6-

category output vectors.

Page 57: Backprop, 25 years later…

Back propagation, 25 years later

66

Discussion• Our model of facial expression recognition:

• Performs the same task people do• On the same stimuli• At about the same accuracy

• Without actually “feeling” anything, without any access to the surrounding culture, it nevertheless:• Organizes the faces in the same order around

the circumplex• Correlates very highly with human responses.• Has about the same rank order difficulty in

classifying the emotions

Page 58: Backprop, 25 years later…

Back propagation, 25 years later

69

The Visual Expertise Mystery

• Why would a face area process BMW’s?:Why would a face area process BMW’s?:• Behavioral & brain dataBehavioral & brain data• Model of expertiseModel of expertise• resultsresults

Page 59: Backprop, 25 years later…

Back propagation, 25 years later

70

Are you a perceptual expert?

Take the expertise test!!!**

“Identify this object with the first name that comes to mind.”

**Courtesy of Jim Tanaka, University of Victoria

Page 60: Backprop, 25 years later…

Back propagation, 25 years later

71

“Car” - Not an expert

“2002 BMW Series 7” - Expert!

Page 61: Backprop, 25 years later…

Back propagation, 25 years later

72

“Bird” or “Blue Bird” - Not an expert

“Indigo Bunting” - Expert!

Page 62: Backprop, 25 years later…

Back propagation, 25 years later

73

“Face” or “Man” - Not an expert

“George Dubya”- Expert!““Jerk” or “Megalomaniac” - Jerk” or “Megalomaniac” -

Democrat!Democrat!

Page 63: Backprop, 25 years later…

Back propagation, 25 years later

74

How is an object to be named?

Bird Basic Level(Rosch et al., 1971)

Indigo Bunting SubordinateSpecies Level

Animal Superordinate Level

Page 64: Backprop, 25 years later…

Back propagation, 25 years later

75

Entry Point Recognition

Bird

Indigo Bunting

Animal

Visual analysis

Fine grain visual analysis

Semantic analysis

Entry Point

Entry Point

DownwardShift Hypothesis

Page 65: Backprop, 25 years later…

Back propagation, 25 years later

76

Dog and Bird Expert Study

• Each expert had a least 10 years experiencein their respective domain of expertise.

• None of the participants were experts in both dogs and birds.

• Participants provided their own controls.

Tanaka & Taylor, 1991

Page 66: Backprop, 25 years later…

Back propagation, 25 years later

77

Robin

Bird

Animal

Object Verification TaskSuperordinate

Basic

Subordinate

Plant

Dog

Sparrow

YES NO YES NO

Page 67: Backprop, 25 years later…

Back propagation, 25 years later

78

600

Mea

n R

eact

ion

Tim

e (m

sec)

700

800

900

Superordinate Basic Subordinate

DownwardShift

Expert Domain

Novice Domain

Dog and bird experts recognize objects in their domain of expertise at subordinate levels.

Animal Bird/Dog Robin/Beagle

Page 68: Backprop, 25 years later…

Back propagation, 25 years later

79

Is face recognition a general form of perceptual expertise?

George W.Bush

IndigoBunting

2002Series 7

BMW

Page 69: Backprop, 25 years later…

Back propagation, 25 years later

80

600

Mea

n R

eact

ion

Tim

e (m

sec)

800

1000

1200

Superordinate Basic SubordinateTanaka, 2001

DownwardShift

Face experts recognize faces at the individual level of unique identity

Faces

Objects

Page 70: Backprop, 25 years later…

Back propagation, 25 years later

81

Bentin, Allison, Puce, Perez & McCarthy, 1996

Event-related Potentials and Expertise

N170

Face Experts Object Experts

Tanaka & Curran, 2001; see also Gauthier, Curran, Curby & Collins, 2003, Nature Neuro.

Novice Domain Expert Domain

Page 71: Backprop, 25 years later…

Back propagation, 25 years later

82

Neuroimaging of face, bird and car experts

“Face Experts”

FusiformGyrus

CarExperts

BirdExperts

FusiformGyrus

FusiformGyrus

Cars-Objects Birds-Objects

Gauthier et al., 2000

Page 72: Backprop, 25 years later…

Back propagation, 25 years later

83

Behavioral benchmarks of expertise• Downward shift in entry point recognition • Improved discrimination of novel exemplars from learned and related categories

Neurological benchmarks of expertise• Enhancement of N170 ERP brain component• Increased activation of fusiform gyrus

How to identify an expert?

Page 73: Backprop, 25 years later…

84

End of Tanaka Slides

Page 74: Backprop, 25 years later…

Back propagation, 25 years later

85

• Kanwisher showed the FFA is specialized for faces•But she forgot to control for what???

Page 75: Backprop, 25 years later…

Back propagation, 25 years later

86

Greeble Experts (Gauthier et al. 1999)

• Subjects trained over many hours to recognize individual Greebles.

• Activation of the FFA increased for Greebles as the training proceeded.

Page 76: Backprop, 25 years later…

Back propagation, 25 years later

87

The “visual expertise mystery”• If the so-called “Fusiform Face Area” (FFA) is

specialized for face processing, then why would it also be used for cars, birds, dogs, or Greebles?

• Our view: the FFA is an area associated with a process: fine level discrimination of homogeneous categories.

• But the question remains: why would an area that presumably starts as a face area get recruited for these other visual tasks? Surely, they don’t share features, do they?

Sugimoto & Cottrell (2001), Proceedings of the Cognitive Science Society

Page 77: Backprop, 25 years later…

Back propagation, 25 years later

88

Solving the mystery with models• Main idea:

• There are multiple visual areas that could compete to be the Greeble expert - “basic” level areas and the “expert” (FFA) area.

• The expert area must use features that distinguish similar looking inputs -- that’s what makes it an expert

• Perhaps these features will be useful for other fine-level discrimination tasks.

• We will create • Basic level models - trained to identify an object’s class• Expert level models - trained to identify individual objects.• Then we will put them in a race to become Greeble

experts.• Then we can deconstruct the winner to see why they won.

Sugimoto & Cottrell (2001), Proceedings of the Cognitive Science Society

Page 78: Backprop, 25 years later…

Back propagation, 25 years later

89

Model Database

• A network that can differentiate faces, books, cups A network that can differentiate faces, books, cups andand cans is a “basic level network.”cans is a “basic level network.”•A network that can also differentiate individuals within A network that can also differentiate individuals within ONE ONE class (faces, cups, cans OR books) is an “expert.”class (faces, cups, cans OR books) is an “expert.”

Page 79: Backprop, 25 years later…

Back propagation, 25 years later

90

Model

• Pretrain two groups of neural networks on different tasks.

• Compare the abilities to learn a new individual Greeble classification task.

cup

Carol

book

can

BobTed

cancupbookface

Hidden layer

Greeble1Greeble2Greeble3

Greeble1Greeble2Greeble3

(Experts)

(Non-experts)

Page 80: Backprop, 25 years later…

Back propagation, 25 years later

91

Expertise begets expertise

• Learning to individuate cups, cans, books, or faces first, leads to faster learning of Greebles (can’t try this with kids!!!).

• The more expertise, the faster the learning of the new task!• Hence in a competition with the object area, FFA would win.Hence in a competition with the object area, FFA would win.• If our parents were cans, the FCA (Fusiform Can Area) would win.

AmountAmountOfOf

TrainingTrainingRequiredRequiredTo be aTo be aGreebleGreebleExpertExpert

Training Time on first taskTraining Time on first task

Page 81: Backprop, 25 years later…

Back propagation, 25 years later

92

Entry Level Shift: Entry Level Shift: Subordinate RT decreases with Subordinate RT decreases with

trainingtraining (rt = uncertainty of response = 1.0 -(rt = uncertainty of response = 1.0 -max(output)max(output)))

Human dataHuman data--- Subordinate Basic

RT

Network dataNetwork data

# Training Sessions

Page 82: Backprop, 25 years later…

Back propagation, 25 years later

93

How do experts learn the task?How do experts learn the task?

• Expert level networks must be Expert level networks must be sensitivesensitive to within-class variation:to within-class variation:• Representations must Representations must amplifyamplify small small

differencesdifferences• Basic level networks must Basic level networks must ignoreignore

within-class variation.within-class variation.• Representations should Representations should reducereduce

differencesdifferences

Page 83: Backprop, 25 years later…

Back propagation, 25 years later

94

Observing hidden layer representations

• Principal Components Analysis on hidden unit Principal Components Analysis on hidden unit activation:activation:• PCA of hidden unit activations allows us to PCA of hidden unit activations allows us to

reduce the dimensionality (to 2) and plot reduce the dimensionality (to 2) and plot representations.representations.

• We can then observe how tightly clustered We can then observe how tightly clustered stimuli are in a low-dimensional subspacestimuli are in a low-dimensional subspace

• We expect basic level networks to separate We expect basic level networks to separate classes, but not individuals.classes, but not individuals.

• We expect expert networks to separate We expect expert networks to separate classes and individuals.classes and individuals.

Page 84: Backprop, 25 years later…

Back propagation, 25 years later

95

Subordinate level training Subordinate level training magnifies small differences magnifies small differences

withinwithin object representations object representations1 epoch 80 epochs 1280 epochs

Face

Basic

greeble

Page 85: Backprop, 25 years later…

Back propagation, 25 years later

96

Greeble representations are Greeble representations are spread out prior to Greeble spread out prior to Greeble

TrainingTraining

FaceBasic

greeble

Page 86: Backprop, 25 years later…

Back propagation, 25 years later

98

Examining the Net’s Representations

• We want to visualize “receptive fields” in the network.• But the Gabor magnitude representation is noninvertible.• We can learn an approximate inverse mapping, however.• We used linear regression to find the best linear

combination of Gabor magnitude principal components for each image pixel.

• Then projecting each hidden unit’s weight vector into image space with the same mapping visualizes its “receptive field.”

Page 87: Backprop, 25 years later…

Back propagation, 25 years later

99

Two hidden unit receptive fieldsAFTER TRAINING AS A FACE

EXPERTAFTER FURTHER TRAINING ON GREEBLES

HU 16HU 16

HU 36HU 36

NOTE: These are not face-NOTE: These are not face-specific!specific!

Page 88: Backprop, 25 years later…

Back propagation, 25 years later

105

The Face and Object Processing The Face and Object Processing SystemSystem

PCAGaborFilterin

g

BookCanCupFaceBobSueJane

ExpertLevel

Classifier

Pixel(Retina)

Level

GestaltLevel

Perceptual

(V1)Level

CategoryLevel

HiddenLayer

BookCanCupFace

BasicLevel

Classifier

FFA

LOC

Page 89: Backprop, 25 years later…

Back propagation, 25 years later

106

Effects accounted for by this model

• Categorical effects in facial expression recognition (Dailey et al., 2002)• Similarity effects in facial expression recognition (ibid)• Why fear is “hard” (ibid)• Rank of difficulty in configural, featural, and inverted face discrimination

(Zhang & Cottrell, 2006)• Holistic processing of identity and expression, and the lack of interaction

between the two (Cottrell et al. 2000)• How a specialized face processor may develop under realistic developmental

constraints (Dailey & Cottrell, 1999)• How the FFA could be recruited for other areas of expertise (Sugimoto &

Cottrell, 2001; Joyce & Cottrell, 2004, Tran et al., 2004; Tong et al, 2005)• The other race advantage in visual search (Haque & Cottrell, 2005)• Generalization gradients in expertise (Nguyen & Cottrell, 2005)• Priming effects (face or character) on discrimination of ambiguous chinese

characters (McCleery et al, under review)• Age of Acquisition Effects in face recognition (Lake & Cottrell, 2005; L&C,

under review)• Memory for faces (Dailey et al., 1998, 1999)• Cultural effects in facial expression recognition (Dailey et al., in preparation)

Page 90: Backprop, 25 years later…

Back propagation, 25 years later

107

Backprop, 25 years later

• Backprop is important because it was the first relatively efficient method for learning internal representations

• Recent advances have made deeper networks possible

• This is important because we don’t know how the brain uses transformations to recognize objects across a wide array of variations (e.g., the Halle Berry neuron)

Page 91: Backprop, 25 years later…

Back propagation, 25 years later

108

• E.g., the “Halle Berry” neuron…

Page 92: Backprop, 25 years later…

109

END