An informal account of BackProp
For each pattern in the training set:
Compute the error at the output nodes
Compute w for each wt in 2nd layer
Compute delta (generalized error expression) for hidden units
Compute w for each wt in 1st layer
After amassing w for all weights and, change each wt a little bit, as determined by the learning rate
jpipij ow
Backprop Details
Here we go…
Also refer to web notes for derivation
k j i
wjk wij
E = Error = ½ ∑i (ti – yi)2
yi
ti: targetij
ijij W
EWW
ijij W
EW
jiiiij
i
i
i
iij
yxfytW
x
x
y
y
E
W
E
)('
The derivative of the sigmoid is just ii yy 1
jiiiiij yyyytW 1
ijij yW iiiii yyyt 1
The output layerlearning rate
k j i
wjk wij
E = Error = ½ ∑i (ti – yi)2
yi
ti: target
The hidden layerjk
jk W
EW
jk
j
j
j
jjk W
x
x
y
y
E
W
E
iijiii
i j
i
i
i
ij
Wxfyty
x
x
y
y
E
y
E)(')(
kji
ijiiijk
yxfWxfytW
E
)(')(')(
kjji
ijiiiijk yyyWyyytW
11)(
jkjk yW jji
ijiiiij yyWyyyt
11)(
jji
iijj yyW
1
Momentum term
The speed of learning is governed by the learning rate. If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum.
Momentum tends to smooth small weight error fluctuations.
n)(n)y()1n(ij
wn)(ij
w ji
10
the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in time.
Convergence
May get stuck in local minima Weights may diverge
…but works well in practice
Representation power:2 layer networks : any continuous function3 layer networks : any function
Local Minimum
USE A RANDOM COMPONENT SIMULATED ANNEALING
Overfitting and generalization
TOO MANY HIDDEN NODES TENDS TO OVERFIT
Overfitting in ANNs
Early Stopping (Important!!!)
Stop training when error goes up on validation set
Stopping criteria
Sensible stopping criteria: total mean squared error change:
Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).
generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.
Architectural ConsiderationsWhat is the right size network for a given job?
How many hidden units?
Too many: no generalization
Too few: no solution
Possible answer: Constructive algorithm, e.g.
Cascade Correlation (Fahlman, & Lebiere 1990)
etc
The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error.
Two types of adaptive algorithms can be used:start from a large network and successively
remove some nodes and links until network performance degrades.
begin with a small network and introduce new neurons until performance is satisfactory.
Network Topology
Problems and Networks
•Some problems have natural "good" solutions
•Solving a problem may be possible by providing the right armory of general-purpose tools, and recruiting them as needed
•Networks are general purpose tools.
•Choice of network type, training, architecture, etc greatly influences the chances of successfully solving a problem
•Tension: Tailoring tools for a specific job Vs Exploiting general purpose learning mechanism
Summary
Multiple layer feed-forward networksReplace Step with Sigmoid (differentiable)
function Learn weights by gradient descent on error
functionBackpropagation algorithm for learningAvoid overfitting by early stopping
ALVINN drives 70mph on highways
Use MLP Neural Networks when …
(vectored) Real inputs, (vectored) real outputs
You’re not interested in understanding how it works
Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset
Applications of FFNNClassification, pattern recognition: FFNN can be applied to tackle non-linearly separable
learning problems. Recognizing printed or handwritten characters, Face recognition Classification of loan applications into credit-worthy and non-
credit-worthy groups Analysis of sonar radar to determine the nature of the source
of a signal
Regression and forecasting: FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose inputs is a sequence of measurements over time (time series).
Extensions of Backprop Nets
Recurrent Architectures Backprop through time
Elman Nets & Jordan Nets
Updating the context as we receive input
• In Jordan nets we model “forgetting” as well
• The recurrent connections have fixed weights
• You can train these networks using good ol’ backprop
Output
Hidden
Context Input
1
α
Output
Hidden
Context Input
1
Recurrent Backprop
• we’ll pretend to step through the network one iteration at a time
• backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent)
a b c unrolling3 iterations
a b c
a b c
a b cw2
w1 w3
w4
w1 w2 w3 w4
a b c
Connectionist Models in Cognitive Science
Structured PDP (Elman)
Neural Conceptual Existence Data Fitting
Hybrid
5 levels of Neural Theory of Language
Cognition and Language
Computation
Structured Connectionism
Computational Neurobiology
Biology
MidtermQuiz Finals
Neural Development
Triangle NodesNeural Net and learning
Spatial Relation
Motor Control Metaphor
SHRUTI
Grammar
abst
ract
ion
Pyscholinguistic experiments
The Color Story: A Bridge The Color Story: A Bridge between Levels of NTLbetween Levels of NTL
(http://www.ritsumei.ac.jp/~akitaoka/color-e.html
A Tour of the Visual System
• two regions of interest:
– retina
– LGN
The Physics of Light
Light: Electromagnetic energy whose wavelength is between 400 nm and 700 nm. (1 nm = 10 meter)-6
400 500 600 700
ELECTROMAGNETIC SPECTRUM
VISIBLE SPECTRUM
10-14 meters 106 meters
Wavelength (nm)
CosmicRays
GammaRays X-rays UV Infra-
RedMicro-waves TV RadioLight
© Stephen E. Palmer, 2002
The Physics of Light
.
# P
ho
ton
s
D. Normal Daylight
Wavelength (nm.)
B. Gallium Phosphide Crystal
400 500 600 700
# P
ho
ton
s
Wavelength (nm.)
A. Ruby Laser
400 500 600 700
400 500 600 700
# P
ho
ton
s
C. Tungsten Lightbulb
400 500 600 700
# P
ho
ton
s
Some examples of the spectra of light sources
© Stephen E. Palmer, 2002
The Physics of Light
Some examples of the reflectance spectra of surfaces
Wavelength (nm)
% P
hoto
ns R
efle
cted
Red
400 700
Yellow
400 700
Blue
400 700
Purple
400 700
© Stephen E. Palmer, 2002
The Psychophysical Correspondence
There is no simple functional description for the perceivedcolor of all lights under all viewing conditions, but …...
A helpful constraint: Consider only physical spectra with normal distributions
area
Wavelength (nm.)
# Photons
400 700500 600
mean
variance
© Stephen E. Palmer, 2002
Physiology of Color Vision
© Stephen E. Palmer, 2002
Cones cone-shaped less sensitive operate in high light color vision
Rods rod-shaped highly sensitive operate at night gray-scale vision
Two types of light-sensitive receptors
cone
rod
The Microscopic View
http://www.iit.edu/~npr/DrJennifer/visual/retina.html
Rods and Cones in the Retina
What Rods and Cones Detect
Notice how they aren’t distributed evenly, and the rod is more sensitive to shorter wavelengths
© Stephen E. Palmer, 2002
.
400 450 500 550 600 650
RE
LAT
IVE
AB
SO
RB
AN
CE
(%
)
WAVELENGTH (nm.)
100
50
440
S
530 560 nm.
M L
Three kinds of cones: Absorption spectra
Implementation of Trichromatic theory
Physiology of Color Vision
Opponent Processes: R/G = L-M G/R = M-L B/Y = S-(M+L) Y/B = (M+L)-S
© Stephen E. Palmer, 2002
Double Opponent Cells in V1
Physiology of Color Vision
G+R-
G+R-
R+G-
R+G-
Red/Green
Y+B-
Y+B-
B+Y-
B+Y-
Blue/Yellow
Color Blindness
Not everybody perceives colors in the same way!
What numbers do you see in these displays?
© Stephen E. Palmer, 2002
Theories of Color Vision
A Dual Process Wiring Diagram
© Stephen E. Palmer, 2002
S M L
R+ G-
+ +- -
B+ Y-+
+
- -
G+Y+
Bk+
S-M-L
++
L-M -S+M+L -S-M-L M-L
W+ Bk-
S+M+L
++ -
-
MLML
S M L
W-
B- R-
Trichromatic Stage
Opponent Process Stage
Color Naming
© Stephen E. Palmer, 2002
Basic Color Terms (Berlin & Kay)
Criteria:
1. Single words -- not “light-blue” or “blue-green”
2. Frequently used -- not “mauve” or “cyan”
3. Refer primarily to colors -- not “lime” or “gold”
4. Apply to any object -- not “roan” or “blond”
Color Naming
© Stephen E. Palmer, 2002
BCTs in English
RedGreenBlueYellowBlackWhite
GrayBrownPurpleOrange*Pink
Color Naming
© Stephen E. Palmer, 2002
Five more BCTs in a study of 98 languages
Light-BlueWarmCoolLight-WarmDark-Cool
The WCS Color Chips
• Basic color terms:– Single word (not blue-green)
– Frequently used (not mauve)
– Refers primarily to colors (not lime)
– Applies to any object (not blonde)
FYI:
English has 11 basic color terms
Results of Kay’s Color Study
If you group languages into the number of basic color terms they have, as the number of color terms increases, additional terms specify focal colors
Stage I II IIIa / IIIb IV V VI VII
W or R or Y W W W W W W
Bk or G or Bu R or Y R or Y R R R R
Bk or G or Bu G or Bu Y Y Y Y
Bk G or Bu G G G
Bk Bu Bu Bu
W Bk Bk Bk
R Y+Bk (Brown) Y+Bk (Brown)
Y R+W (Pink)
Bk or G or Bu R + Bu (Purple)
R+Y (Orange)
B+W (Grey)
Color Naming
© Stephen E. Palmer, 2002
Typical “developmental” sequence of BCTs
Light-warm
Dark-cool
(2 Terms)
White
Warm
Dark-cool
(3 Terms)
Black
Cool
White
Warm
(4 Terms)
Red
Yellow
White
Black
Cool
(5 Terms)
White
Red
Yellow
Black
Green
Blue
(6 Terms)
Color Naming
© Stephen E. Palmer, 2002
Studied color categories in two ways
Boundaries
Best examples
(Berlin & Kay)
Color Naming
© Stephen E. Palmer, 2002
MEMORY : Focal colors are remembered better than nonfocal colors.
LEARNING: New color categories centered on focal colors are learned faster.
Categorization: Focal colors are categorized more quickly than nonfocal colors.
(Rosch)
Color Naming
Deg
ree
of M
embe
rshi
p
FUZZY SETS AND FUZZY LOGIC (Zadeh)
0
1.0
0
"Green"
very
not-at-all
a little bit
sorta
Hue
extremely
Degree ofMembership
Fuzzy set theory (Zadeh)
A fuzzy logical model of color naming (Kay & Mc Daniel)
© Stephen E. Palmer, 2002
Color Naming
© Stephen E. Palmer, 2002
0
1
Degree ofMembership
Hue
Blue Green Yellow Red
focal blue
focalgreen
focalyellow
focal red
BlueGreen
Yellow
Red
Hue
0
1
Degree ofMembership
“Primary” color categories
Color Naming
© Stephen E. Palmer, 2002
“Primary” color categories
RedGreenBlueYellowBlackWhite
Color Naming
© Stephen E. Palmer, 2002
“Derived” color categories.
Hue
0
1 Yellow Red
Y R
U
Degree ofMembership
Hue
Hue
Orange
0
1
Degree ofMembership
Fuzzylogical“ANDf”
Color Naming
© Stephen E. Palmer, 2002
“Derived” color categories
Orange = Red ANDf YellowPurple = Red ANDf BlueGray = Black ANDf WhitePink = Red ANDf WhiteBrown = Yellow ANDf Black(Goluboi = Blue ANDf White)
Color Naming
© Stephen E. Palmer, 2002
“Composite” color categories
Fuzzylogical“ORf”
Hue
0
1 Yellow Red
Y RU Degree ofMembership
Hue
Warm = Red Orf YellowCool = Blue Orf GreenLight-warm = White Orf WarmDark-cool = Black Orf Cool
Color Naming
FUZZY LOGICAL MODEL OF COLOR NAMING (Kay & McDaniel)
RedGreenBlue
YellowBlackWhite
OrangePurpleBrownPinkGray
[Light-blue]
[Warm][Cool]
[Light-warm][Dark-cool]
PRIMARY DERIVED COMPOSITE
Only 16 Basic Color Terms in Hundreds of Languages:
1.0
0
1.0
00
Yellow Orange = Yellow ANDf Red Warm = Yellow ORf RED
De
gre
e o
f M
em
be
rsh
ip
(Fuzzy ANDf) (Fuzzy ORf)(Fuzzy sets)
© Stephen E. Palmer, 2002