Upload
kreeli
View
48
Download
1
Embed Size (px)
DESCRIPTION
Backprop, 25 years later… . Garrison W. Cottrell Gary's Unbelievable Research Unit (GURU) Computer Science and Engineering Department Institute for Neural Computation UCSD. But first…. Hal White passed away March 31st, 2012 - PowerPoint PPT Presentation
Citation preview
1
Backprop, 25 years later…
Garrison W. CottrellGary's Unbelievable Research Unit (GURU)Computer Science and Engineering DepartmentInstitute for Neural ComputationUCSD
Back propagation, 25 years later
2
But first…Hal White passed away March 31st,
2012
• Hal was “our theoretician of neural nets,” and one of the nicest guys I knew.
• His paper on “A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity” has been cited 15,805 times, and led to him being shortlisted for the Nobel Prize.
• But his paper with Max Stinchcombe: “Multilayer feedforward networks are universal approximators” is his second most-cited paper, at 8,114 cites.
Back propagation, 25 years later
3
But first…• In yet another paper (in Neural
Computation, 1989), he wrote
“The premise of this article is that learning procedures used to train artificial neural networks are inherently statistical techniques. It follows that statistical theory can provide considerable insight into the properties, advantages, and disadvantages of different network learning methods…”
This was one of the first papers to make the connection between neural networks and statistical models - and thereby put them on a sound statistical foundation.
Back propagation, 25 years later
4
We should also remember…Dave E. Rumelhart passed
away on March 13, 2011
• Many had invented back propagation; few could appreciate as deeply as Dave did what they had when they discovered it.
Back propagation, 25 years later
5
What is backpropagation, and why is/was it important?
• We have billions and billions of neurons that somehow work together to create the mind.
• These neurons are connected by 1014 - 1015 synapses, which we think encode the “knowledge” in the network - too many for us to explicitly program them in our models
• Rather we need some way to indirectly indirectly set them via a procedure that will achieve some goal by changing the synaptic strengths (which we call weights).
• This is called learninglearning in these systems.
Back propagation, 25 years later
6
Learning: A bit of history
• Frank Rosenblatt studied a simple version of a neural net called a perceptron:• A single layer of processing• Binary output• Can compute simple things like (some) boolean
functions (OR, AND, etc.)
Back propagation, 25 years later
7
Learning: A bit of history
net input
output
Back propagation, 25 years later
8
Learning: A bit of history
Back propagation, 25 years later
9
Learning: A bit of history
• Rosenblatt (1962) discovered a learning rule for perceptrons called the perceptron convergence procedure.
• Guaranteed to learn anything computable (by a two-layer perceptron)
• Unfortunately, not everything was computable (Minsky & Papert, 1969)
Back propagation, 25 years later
10
Perceptron Learning Demonstration
• Output activation rule: • First, compute the net input to the output unit:
wixi = net• Then, compute the output as: If net then output = 1
else output = 0
net input
output
Back propagation, 25 years later
11
Perceptron Learning Demonstration
• Output activation rule: • First, compute the net input to the output unit:
wixi = net If net then output = 1
else output = 0• Learning rule:
If output is 1 and should be 0, then lower weights to active inputs and raise the threshold ()
If output is 0 and should be 1, then raise weights to active inputs and lower the threshold ()
(“active input” means xi = 1, not 0)
Back propagation, 25 years later
12
Characteristics of perceptron learning
• Supervised learning: Gave it a set of input-output examples for it to model the function (a teaching signal)
• Error correction learning: only correct it when it is wrong.
• Random presentation of patterns.
• Slow! Learning on some patterns ruins learning on others.
Back propagation, 25 years later
13
Perceptron Learning Made Simple
• Output activation rule: • First, compute the net input to the output unit:
wixi = net If net then output = 1
else output = 0• Learning rule:
If output is 1 and should be 0, then lower weights to active inputs and raise the threshold ()
If output is 0 and should be 1, then raise weights to active inputs and lower the threshold ()
Back propagation, 25 years later
14
Perceptron Learning Made Simple
• Learning rule:If output is 1 and should be 0, then lower weights to active
inputs and raise the threshold ()If output is 0 and should be 1, then raise weights to active
inputs and lower the threshold ()• Learning rule:
wi(t+1) = wi(t) + *(teacher - output)*xi
( is the learning rate)• This is known as the delta rule because learning
is based on the delta (difference) between what you did and what you should have done: = (teacher - output)
Back propagation, 25 years later
15
Problems with perceptrons
• The learning rule comes with a great guarantee: anything a perceptron can compute, it can learn to compute.
• Problem: Lots of things were not computable, e.g., XOR (Minsky & Papert, 1969)
• Minsky & Papert said: • if you had hidden units, you could compute any boolean
function.• But no learning rule exists for such multilayer networks, and
we don’t think one will ever be discovered.
Back propagation, 25 years later
16
Problems with perceptrons
Back propagation, 25 years later
17
Aside about perceptrons
• They didn’t have hidden units - but Rosenblatt assumed nonlinear preprocessing!
• Hidden units compute features of the input
• The nonlinear preprocessing is a way to choose features by hand.
• Support Vector Machines essentially do this in a principled way, followed by a (highly sophisticated) perceptron learning algorithm.
Back propagation, 25 years later
18
Enter Rumelhart, Hinton, & Williams (1985)
• Discovered a learning rule for networks with hidden units.
• Works a lot like the perceptron algorithm:
• Randomly choose an input-output pattern• present the input, let activation propagate through the
network• give the teaching signal • propagate the error back through the network (hence the
name back propagation) • change the connection strengths according to the error
Back propagation, 25 years later
19
Enter Rumelhart, Hinton, & Williams (1985)
• The actual algorithm uses the chain rule of calculus to go downhill in an error measure with respect to the weights• The hidden units must learn features that solve the problem
. . .. . .
. . .. . .
ActivationActivation ErrorError
INPUTSINPUTS
OUTPUTSOUTPUTS
Hidden UnitsHidden Units
Back propagation, 25 years later
20
XOR
• Here, the hidden units learned AND and OR - two features that when combined appropriately, can solve the problem
Back Propagation Back Propagation LearningLearning
Random NetworkRandom Network XOR NetworkXOR Network
ANDANDOROR
Back propagation, 25 years later
21
XOR
But, depending on initial conditions, there are an infinite number of ways to do XOR - backprop can surprise you with innovative solutions.
Back Propagation Back Propagation LearningLearning
Random NetworkRandom Network
XOR NetworkXOR Network
ANDANDOROR
Back propagation, 25 years later
22
Why is/was this wonderful?
• Efficiency• Learns internal representations• Learns internal representations• Learns internal representations• Generalizes to recurrent networksrecurrent networks
Back propagation, 25 years later
23
Hinton’s Family Trees example• Idea: Learn to represent relationships between
people that are encoded in a family tree:
Back propagation, 25 years later
24
Hinton’s Family Trees example• Idea 2: Learn distributed representations of
concepts: localist outputs
Learn: features of theseentities useful for
solving the task
Input: localist people localist relationsLocalist: one unit “ON” to represent
each item
Back propagation, 25 years later
25
People hidden units: Hinton diagram
• The corresponding people in the two trees are above/below one another; english above, italian below
Back propagation, 25 years later
26
People hidden units: Hinton diagram
• What does the unit 1 encode?
What is unit 1 encoding?What is unit 1 encoding?
Back propagation, 25 years later
27
People hidden units: Hinton diagram
• What does unit 2 encode?
What is unit 2 encoding?What is unit 2 encoding?
Back propagation, 25 years later
28
People hidden units: Hinton diagram
• Unit 6?
What is unit 6 encoding?What is unit 6 encoding?
Back propagation, 25 years later
29
Relation units
• What does the upper right one code?
Back propagation, 25 years later
30
Lessons• The network learns features in the service
of the task - i.e., it learns features on its own.
• This is useful if we don’t know what the features ought to be.
• Can explain some human phenomena
Back propagation, 25 years later
31
Another example• In the next example(s), I make two points:
• The perceptron algorithm is still useful!
• Representations learned in the service of the task can explain the “Visual Expertise Mystery”
Back propagation, 25 years later
32
A Face Processing SystemA Face Processing System
PCA
.
.
.
.
.
.
HappySadAfraidAngrySurprisedDisgustedNeural
Net
Pixel(Retina)
Level
Object(IT)
Level
Perceptual(V1)
Level
CategoryLevel
GaborFiltering
Back propagation, 25 years later
33
The Face Processing SystemThe Face Processing System
PCA
.
.
.
.
.
.
BobCarolTedAlice
NeuralNet
Pixel(Retina)
Level
Object(IT)
Level
Perceptual(V1)
Level
CategoryLevel
GaborFiltering
Back propagation, 25 years later
34
The Face Processing SystemThe Face Processing System
PCA
.
.
.
.
.
.
BobCarolTedAlice
NeuralNet
Pixel(Retina)
Level
Object(IT)
Level
Perceptual(V1)
Level
CategoryLevel
GaborFiltering
Back propagation, 25 years later
35
The Face Processing SystemThe Face Processing System
PCA...
.
.
.
GaborFiltering
BobCarolTedCupCanBookNeural
Net
Pixel(Retina)
Level
Object(IT)
Level
Perceptual(V1)
Level
CategoryLevelFeatureFeature
levellevel
Back propagation, 25 years later
36
The Gabor Filter Layer• Basic feature: the 2-D Gabor wavelet filter (Daugman, 85):
• These model the processing in early visual areas
Convolution
*
Magnitudes
Subsample in a 29x36 grid
Back propagation, 25 years later
37
Principal Components Analysis• The Gabor filters give us 40,600 numbers• We use PCA to reduce this to 50 numbers• PCA is like Factor Analysis: It finds the underlying
directions of Maximum Variance• PCA can be computed in a neural network through
a competitive Hebbian learning mechanism• Hence this is also a biologically plausible
processing step• We suggest this leads to representations similar to
those in Inferior Temporal cortex
Back propagation, 25 years later
38
How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &
Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations
(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)
...
Holons (Gestalt layer)
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
39
How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &
Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations
(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)
...
Holons (Gestalt layer)
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
40
How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &
Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations
(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)
...
Holons (Gestalt layer)
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
41
How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &
Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations
(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)
...
Holons (Gestalt layer)
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
42
How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &
Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations
(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)
...
Holons (Gestalt layer)
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
43
How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &
Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations
(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)
...
Holons (Gestalt layer)
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
44
How to do PCA with a neural network(Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell &
Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations
(features, Principal Components, Holons, eigenfaces)(features, Principal Components, Holons, eigenfaces)
...
Holons (Gestalt layer)
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
46
The Final Layer: Classification(Cottrell & Fleming 1990; Cottrell & Metcalfe 1990; Padgett & Cottrell 1996;
Dailey & Cottrell, 1999; Dailey et al. 2002)
The holistic representation is then used as input to a categorization network trained by supervised learning.
• Excellent generalization performance demonstrates the sufficiency of the holistic representation for recognition
Holons
CategoriesCategories
...
Output: Cup, Can, Book, Greeble, Face, Bob, Carol, Ted, Happy, Sad, Afraid, etc.Output: Cup, Can, Book, Greeble, Face, Bob, Carol, Ted, Happy, Sad, Afraid, etc.
Input fromInput fromPerceptual LayerPerceptual Layer
Back propagation, 25 years later
47
The Final Layer: Classification• Categories can be at different levels: basic, subordinate.• Simple learning rule (~delta rule). It says (mild lie here):
• add inputs to your weights (synaptic strengths) when you are supposed to be on,
• subtract them when you are supposed to be off.• This makes your weights “look like” your favorite
patterns – the ones that turn you on.• When no hidden units => No back propagation of error.• When hidden units: we get task-specific features (most
interesting when we use the basic/subordinate distinction)
Back propagation, 25 years later
48
Facial Expression Database• Ekman and Friesen quantified muscle movements (Facial
Actions) involved in prototypical portrayals of happiness, sadness, fear, anger, surprise, and disgust.• Result: the Pictures of Facial Affect Database (1976).• 70% agreement on emotional content by naive human
subjects.• 110 images, 14 subjects, 7 expressions.
Anger, Disgust, Neutral, Surprise, Happiness (twice), Fear, and Sadness This is actor “JJ”: The easiest for humans (and our model) to classify
Back propagation, 25 years later
49
Results (Generalization)
• Kendall’s tau (rank order correlation): .667, p=.0441
• Note: This is an emergent property of the model!
ExpressionExpression Network % CorrectNetwork % Correct Human % AgreementHuman % Agreement
HappinessHappiness 100.0%100.0% 98.7%98.7%
SurpriseSurprise 100.0%100.0% 92.4%92.4%
DisgustDisgust 100.0%100.0% 92.3%92.3%
AngerAnger 89.2%89.2% 88.9%88.9%
SadnessSadness 82.9%82.9% 89.2%89.2%
FearFear 66.7%66.7% 87.7%87.7%
AverageAverage 89.9%89.9% 91.6%91.6%
Back propagation, 25 years later
50
Correlation of Net/Human Errors
• Like all good Cognitive Scientists, we like our models to make the same mistakes people do!
• Networks and humans have a 6x6 confusion matrix for the stimulus set.
• This suggests looking at the off-diagonal terms: The errors
• Correlation of off-diagonal terms: r = 0.567. [F (1,28) = 13.3; p = 0.0011]
• Again, this correlation is an emergent property of the model: It was not told which expressions were confusing.
Back propagation, 25 years later
51
Examining the Net’s Representations
• We want to visualize “receptive fields” in the network.• But the Gabor magnitude representation is noninvertible.• We can learn an approximate inverse mapping, however.• We used linear regression to find the best linear
combination of Gabor magnitude principal components for each image pixel.
• Then projecting each unit’s weight vector into image space with the same mapping visualizes its “receptive field.”
Back propagation, 25 years later
52
Examining the Net’s Representations
• The “y-intercept” coefficient for each pixel is simply the average pixel value at that location over all faces, so subtracting the resulting “average face” shows more precisely what the units attend to:
• Apparently local features appear in the global templates.
Back propagation, 25 years later
58
Modeling Six-Way Forced Choice
• Overall correlation r=.9416, with NO FIT PARAMETERS!
iness Fear Sadness Disgust Anger Happ-Surprise
Back propagation, 25 years later
59
Model Discrimination Scores
• The model fits the data best at a precategorical layer: The layer we call the “object” layer; NOT at the category level
iness Fear Sadness Disgust Anger Happ-Surprise
PERCENT CORRECT DISCRIMINATIONPERCENT CORRECT DISCRIMINATION
HUMANHUMAN
MODELMODELOUTPUTOUTPUTLAYERLAYERR=0.36R=0.36
MODELMODELOBJECTOBJECTLAYERLAYERR=0.61R=0.61
Back propagation, 25 years later
62
Reaction Time: Human/Modeliness Fear Sadness Disgust Anger Happ-Surprise
Correlation between model & data: .6771, Correlation between model & data: .6771, p<.001p<.001
HUMANHUMAN SUBJECTS REACTION TIMESUBJECTS REACTION TIME
MODEL REACTION TIME (1 MODEL REACTION TIME (1 - - MAX_OUTPUT)MAX_OUTPUT)
Back propagation, 25 years later
63
Mix Detection in the ModelMixed-In Expression Detection
-0.5
-0.3
-0.1
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
10 30 50
Percent Mixing
Average Rank Score
Mixed-in expression(humans)Unrelated expressions(humans)Mixed-in expression(networks)Unrelated expressions(networks)
Can the network account for the continuous data as well as the categorical data? YES.
Back propagation, 25 years later
64
Human/Model CircumplexesThese are derived from similarities between images
using non-metric Multi-dimensional scaling.For humans: similarity is correlation between 6-way
forced-choice button push.For networks: similarity is correlation between 6-
category output vectors.
Back propagation, 25 years later
66
Discussion• Our model of facial expression recognition:
• Performs the same task people do• On the same stimuli• At about the same accuracy
• Without actually “feeling” anything, without any access to the surrounding culture, it nevertheless:• Organizes the faces in the same order around
the circumplex• Correlates very highly with human responses.• Has about the same rank order difficulty in
classifying the emotions
Back propagation, 25 years later
69
The Visual Expertise Mystery
• Why would a face area process BMW’s?:Why would a face area process BMW’s?:• Behavioral & brain dataBehavioral & brain data• Model of expertiseModel of expertise• resultsresults
Back propagation, 25 years later
70
Are you a perceptual expert?
Take the expertise test!!!**
“Identify this object with the first name that comes to mind.”
**Courtesy of Jim Tanaka, University of Victoria
Back propagation, 25 years later
71
“Car” - Not an expert
“2002 BMW Series 7” - Expert!
Back propagation, 25 years later
72
“Bird” or “Blue Bird” - Not an expert
“Indigo Bunting” - Expert!
Back propagation, 25 years later
73
“Face” or “Man” - Not an expert
“George Dubya”- Expert!““Jerk” or “Megalomaniac” - Jerk” or “Megalomaniac” -
Democrat!Democrat!
Back propagation, 25 years later
74
How is an object to be named?
Bird Basic Level(Rosch et al., 1971)
Indigo Bunting SubordinateSpecies Level
Animal Superordinate Level
Back propagation, 25 years later
75
Entry Point Recognition
Bird
Indigo Bunting
Animal
Visual analysis
Fine grain visual analysis
Semantic analysis
Entry Point
Entry Point
DownwardShift Hypothesis
Back propagation, 25 years later
76
Dog and Bird Expert Study
• Each expert had a least 10 years experiencein their respective domain of expertise.
• None of the participants were experts in both dogs and birds.
• Participants provided their own controls.
Tanaka & Taylor, 1991
Back propagation, 25 years later
77
Robin
Bird
Animal
Object Verification TaskSuperordinate
Basic
Subordinate
Plant
Dog
Sparrow
YES NO YES NO
Back propagation, 25 years later
78
600
Mea
n R
eact
ion
Tim
e (m
sec)
700
800
900
Superordinate Basic Subordinate
DownwardShift
Expert Domain
Novice Domain
Dog and bird experts recognize objects in their domain of expertise at subordinate levels.
Animal Bird/Dog Robin/Beagle
Back propagation, 25 years later
79
Is face recognition a general form of perceptual expertise?
George W.Bush
IndigoBunting
2002Series 7
BMW
Back propagation, 25 years later
80
600
Mea
n R
eact
ion
Tim
e (m
sec)
800
1000
1200
Superordinate Basic SubordinateTanaka, 2001
DownwardShift
Face experts recognize faces at the individual level of unique identity
Faces
Objects
Back propagation, 25 years later
81
Bentin, Allison, Puce, Perez & McCarthy, 1996
Event-related Potentials and Expertise
N170
Face Experts Object Experts
Tanaka & Curran, 2001; see also Gauthier, Curran, Curby & Collins, 2003, Nature Neuro.
Novice Domain Expert Domain
Back propagation, 25 years later
82
Neuroimaging of face, bird and car experts
“Face Experts”
FusiformGyrus
CarExperts
BirdExperts
FusiformGyrus
FusiformGyrus
Cars-Objects Birds-Objects
Gauthier et al., 2000
Back propagation, 25 years later
83
Behavioral benchmarks of expertise• Downward shift in entry point recognition • Improved discrimination of novel exemplars from learned and related categories
Neurological benchmarks of expertise• Enhancement of N170 ERP brain component• Increased activation of fusiform gyrus
How to identify an expert?
84
End of Tanaka Slides
Back propagation, 25 years later
85
• Kanwisher showed the FFA is specialized for faces•But she forgot to control for what???
Back propagation, 25 years later
86
Greeble Experts (Gauthier et al. 1999)
• Subjects trained over many hours to recognize individual Greebles.
• Activation of the FFA increased for Greebles as the training proceeded.
Back propagation, 25 years later
87
The “visual expertise mystery”• If the so-called “Fusiform Face Area” (FFA) is
specialized for face processing, then why would it also be used for cars, birds, dogs, or Greebles?
• Our view: the FFA is an area associated with a process: fine level discrimination of homogeneous categories.
• But the question remains: why would an area that presumably starts as a face area get recruited for these other visual tasks? Surely, they don’t share features, do they?
Sugimoto & Cottrell (2001), Proceedings of the Cognitive Science Society
Back propagation, 25 years later
88
Solving the mystery with models• Main idea:
• There are multiple visual areas that could compete to be the Greeble expert - “basic” level areas and the “expert” (FFA) area.
• The expert area must use features that distinguish similar looking inputs -- that’s what makes it an expert
• Perhaps these features will be useful for other fine-level discrimination tasks.
• We will create • Basic level models - trained to identify an object’s class• Expert level models - trained to identify individual objects.• Then we will put them in a race to become Greeble
experts.• Then we can deconstruct the winner to see why they won.
Sugimoto & Cottrell (2001), Proceedings of the Cognitive Science Society
Back propagation, 25 years later
89
Model Database
• A network that can differentiate faces, books, cups A network that can differentiate faces, books, cups andand cans is a “basic level network.”cans is a “basic level network.”•A network that can also differentiate individuals within A network that can also differentiate individuals within ONE ONE class (faces, cups, cans OR books) is an “expert.”class (faces, cups, cans OR books) is an “expert.”
Back propagation, 25 years later
90
Model
• Pretrain two groups of neural networks on different tasks.
• Compare the abilities to learn a new individual Greeble classification task.
cup
Carol
book
can
BobTed
cancupbookface
Hidden layer
Greeble1Greeble2Greeble3
Greeble1Greeble2Greeble3
(Experts)
(Non-experts)
Back propagation, 25 years later
91
Expertise begets expertise
• Learning to individuate cups, cans, books, or faces first, leads to faster learning of Greebles (can’t try this with kids!!!).
• The more expertise, the faster the learning of the new task!• Hence in a competition with the object area, FFA would win.Hence in a competition with the object area, FFA would win.• If our parents were cans, the FCA (Fusiform Can Area) would win.
AmountAmountOfOf
TrainingTrainingRequiredRequiredTo be aTo be aGreebleGreebleExpertExpert
Training Time on first taskTraining Time on first task
Back propagation, 25 years later
92
Entry Level Shift: Entry Level Shift: Subordinate RT decreases with Subordinate RT decreases with
trainingtraining (rt = uncertainty of response = 1.0 -(rt = uncertainty of response = 1.0 -max(output)max(output)))
Human dataHuman data--- Subordinate Basic
RT
Network dataNetwork data
# Training Sessions
Back propagation, 25 years later
93
How do experts learn the task?How do experts learn the task?
• Expert level networks must be Expert level networks must be sensitivesensitive to within-class variation:to within-class variation:• Representations must Representations must amplifyamplify small small
differencesdifferences• Basic level networks must Basic level networks must ignoreignore
within-class variation.within-class variation.• Representations should Representations should reducereduce
differencesdifferences
Back propagation, 25 years later
94
Observing hidden layer representations
• Principal Components Analysis on hidden unit Principal Components Analysis on hidden unit activation:activation:• PCA of hidden unit activations allows us to PCA of hidden unit activations allows us to
reduce the dimensionality (to 2) and plot reduce the dimensionality (to 2) and plot representations.representations.
• We can then observe how tightly clustered We can then observe how tightly clustered stimuli are in a low-dimensional subspacestimuli are in a low-dimensional subspace
• We expect basic level networks to separate We expect basic level networks to separate classes, but not individuals.classes, but not individuals.
• We expect expert networks to separate We expect expert networks to separate classes and individuals.classes and individuals.
Back propagation, 25 years later
95
Subordinate level training Subordinate level training magnifies small differences magnifies small differences
withinwithin object representations object representations1 epoch 80 epochs 1280 epochs
Face
Basic
greeble
Back propagation, 25 years later
96
Greeble representations are Greeble representations are spread out prior to Greeble spread out prior to Greeble
TrainingTraining
FaceBasic
greeble
Back propagation, 25 years later
98
Examining the Net’s Representations
• We want to visualize “receptive fields” in the network.• But the Gabor magnitude representation is noninvertible.• We can learn an approximate inverse mapping, however.• We used linear regression to find the best linear
combination of Gabor magnitude principal components for each image pixel.
• Then projecting each hidden unit’s weight vector into image space with the same mapping visualizes its “receptive field.”
Back propagation, 25 years later
99
Two hidden unit receptive fieldsAFTER TRAINING AS A FACE
EXPERTAFTER FURTHER TRAINING ON GREEBLES
HU 16HU 16
HU 36HU 36
NOTE: These are not face-NOTE: These are not face-specific!specific!
Back propagation, 25 years later
105
The Face and Object Processing The Face and Object Processing SystemSystem
PCAGaborFilterin
g
BookCanCupFaceBobSueJane
ExpertLevel
Classifier
Pixel(Retina)
Level
GestaltLevel
Perceptual
(V1)Level
CategoryLevel
HiddenLayer
BookCanCupFace
BasicLevel
Classifier
FFA
LOC
Back propagation, 25 years later
106
Effects accounted for by this model
• Categorical effects in facial expression recognition (Dailey et al., 2002)• Similarity effects in facial expression recognition (ibid)• Why fear is “hard” (ibid)• Rank of difficulty in configural, featural, and inverted face discrimination
(Zhang & Cottrell, 2006)• Holistic processing of identity and expression, and the lack of interaction
between the two (Cottrell et al. 2000)• How a specialized face processor may develop under realistic developmental
constraints (Dailey & Cottrell, 1999)• How the FFA could be recruited for other areas of expertise (Sugimoto &
Cottrell, 2001; Joyce & Cottrell, 2004, Tran et al., 2004; Tong et al, 2005)• The other race advantage in visual search (Haque & Cottrell, 2005)• Generalization gradients in expertise (Nguyen & Cottrell, 2005)• Priming effects (face or character) on discrimination of ambiguous chinese
characters (McCleery et al, under review)• Age of Acquisition Effects in face recognition (Lake & Cottrell, 2005; L&C,
under review)• Memory for faces (Dailey et al., 1998, 1999)• Cultural effects in facial expression recognition (Dailey et al., in preparation)
Back propagation, 25 years later
107
Backprop, 25 years later
• Backprop is important because it was the first relatively efficient method for learning internal representations
• Recent advances have made deeper networks possible
• This is important because we don’t know how the brain uses transformations to recognize objects across a wide array of variations (e.g., the Halle Berry neuron)
Back propagation, 25 years later
108
• E.g., the “Halle Berry” neuron…
109
END