Chapter 11 – Neural Networks
COMP 540
4/17/2007
Derek Singer
Motivation
• Nonlinear functions of linear combinations of inputs can accurately estimate a wide variety of functions
2 21 2 1 2
1 2
[( ) ( ) ]
4
X X X XX X
Projection Pursuit Regression
1
( ) ( )M
Tm m
m
f X g w X
• An additive model that uses weighted sums of inputs rather than X
• g,w are estimated using a flexible smoothing method
• gm is a ridge function in Rp
• Vm = (wm)TX is projection of X onto unit vector wm
• Pursuing wm that fits model well
• If M arbitrarily large, can approximate any continuous function in Rp arbitrarily well (universal approximator)
• As M increases, interpretability decreases
• PPR useful for prediction
• M = 1: Single index model easy to interpret and slightly more general than linear regression
Projection Pursuit Regression
Fitting PPR Model2
,1 1
arg min ( )N M
Tw g i m m i
i m
y g w x
• To estimate g, given w, consider M=1 model
• With derived variables vi = wTxi , becomes a 1-D smoothing problem
• Any scatterplot smoother (e.g. smoothing spline) can be used
• Complexity constraints on g must be made to prevent overfitting
Fitting PPR Model
( ) ( ) '( )( )T T T Tnew i old i old i old ig w x g w x g w x w w x
• To estimate w, given g
• Want to minimize squared error using Gauss-Newton search (second derivative of g is discarded)
222
1 1
( )( ) '( )
'( )
TN NT T T Ti old i
i i old i old i iTi i old i
y g w xy g w x g w x w x w x
g w x
( )
'( )
TT i old iold i T
old i
y g w xw x
g w x
2'( )Told ig w x
• Use weighted LS regression on x to find wnew
Target: Weights:
• Added (w,g) pair compensates for error in current set of pairs
• g,w estimated iteratively until convergence• M > 1, model built in forward stage-wise manner,
adding a (g,w) pair at each stage• Differentiable smoothing methods preferable, local
regression and smoothing splines convenient
• gm’s from previous steps can be readjusted with backfitting, unclear how affects performance
• wm usually not readjusted but could be
• M usually estimated by forward stage-wise builder, cross-validation can also be used
• Computational demands made it unpopular
Fitting PPR Model
From PPR to Neural Networks
• PPR: Each ridge function is different
• NN: Each node has the same activation/transfer function
• PPR: Optimizing each ridge function separately (as in additive models)
• NN: Optimizing all of the nodes at each training step
Neural Networks• Specifically feed-forward, back-
propagation networks
• Inputs fed forward
• Errors propagated backward
• Made of layers of Processing Elements (PEs, aka perceptrons)
• Each PE represents the function g(wTx), g is a transfer function
• g fixed, unlike PPR
• Output layer of D PEs, yi D
• Hidden layers, in which outputs not directly observed, are optional.
• NN uses parametric functions unlike PPR
• Common ones include:– Threshold f(v) = 1 if v > c, else -1
– Sigmoid f(v) = 1/(1 + e-v), Range [0, 1]
– Tanh f(v) = (ev – e-v)/(ev + e-v), Range [-1,1]
• Desirable properties:– Monotonic, Nonlinear, Bounded
– Easily calculated derivative
– Largest change at intermediate values
Transfer functions
• Must scale inputs so weighted sums will fall in transition region, not saturation region (upper/lower bounds).
• Must scale outputs so range falls within range of transfer function
Sigmoid
Hyperbolic tangent
Threshold
• How many hidden layers and PEs in each layer?
• Adding nodes and layers adds complexity to model
– Beware overfitting
– Beware of extra computational demands
• A 3-layer network with a non-linear transfer function is capable of any function mapping.
Hidden layers
Back Propagation
2
1
( ( ))K
i ik k ik
R y f x
Minimizing the squared error function
1 1
1 1
1
( )
( )
(( * 1* ( ))* )
( ( ))* ( )*
epochSize epochSizei i ki k ki ki
kmi ikm ki k ki ki km
epochSize epochSize
ki k ki mi ki mii i
epochSize
ik k ki k ki mii
R R r g T T
r g T T
r g T z z
y g T g T z
Back Propagation
Back Propagation
1 1
1 1
1 1
( )
( )
(( * 1* ( ))* )
* ( )*
epochSize epochSizei i mi mi mi
mli iml mi mi mi ml
epochSize epochSize
i m mi il mi ili i
epochSize K
ki km m mi ili k
R R r Z Z
r Z Z
r Z x s x
Z x
Learning parameters
• Error function contains many local minima
• If learning too much, might jump over local minima (results in spiky error curve)
• Learning rates, Separate rates for each layer?
• Momentum
• Epoch size, # of epochs
• Initial weights
• Final solution depends on starting weights
• Want weighted sums of inputs to fall in transition region
• Small, random weights centered around zero work well
( ) ( ) ( 1)ij ij ijw t w t w t 0,1
Overfitting and weight decay• If network too complex,
overfitting is likely (very large weights)
• Could stop training before training error minimized
• Could use a validation set to determine when to stop
• Weight decay more explicit, analogous to ridge regression
• Penalizes large weights
2 2
( ) ( ) ( )
( ) km mlkm ml
R R J
R
Other issues
• Neural network training is O(NpML)
– N observations, p predictors, M hidden units, L training epochs
– Epoch sizes also have a linear effect on computation time
– Can take minutes or hours to train
• Long list of parameters to adjust
• Training is a random walk in a vast space
• Unclear when to stop training, can have large impact on performance on test set
• Can avoid guesswork in estimating # of hidden nodes needed with cascade correlation (analogous to PPR)
Cascade Correlation• Automatically finds optimal network structure
• Start with a network with no hidden PEs
• Grow it by one hidden PE at a time
• PPR adds ridge functions that model the current error in the system
• Cascade correlation adds PEs that model the current error in the system
• Train the initial network using any algorithm until error converges or falls under a specified bound
• Take one hidden PE and connect all inputs and all currently existing PEs to it
Cascade Correlation
Cascade Correlation• Train hidden PE to maximally correlate with current
network error
• Gradient ascent rather than descent
1 1
( )( )O P
p po o
o p
S z z E E
O = # of output units, P = # of training patterns
z is the new hidden PE’s current output for pattern p
J = # inputs and hidden PEs connected to new hidden PE
Epo is the residual error on pattern p observed at output unit o
<E>, <z> are the means of E and z over all training patterns
+ is the sign of the term inside the abs. val. brackets
1
( )J
p pj j
j
z f w x
1 1
( ) ( )O P
p kj o o j
o pj
Sw E E f x
w
• Freeze the weights of the inputs and pre-existing hidden PEs to the new hidden PE
• Weights between inputs/hidden PEs and output PEs still live
• Repeat cycle of training with modified network until error converges or falls below specified bound
Cascade Correlation
Box = weight frozen, X = weight live
• No need to guess architecture
• Each hidden PE sees distinct problem and learns solution quickly
• Hidden PEs in backprop networks “engage in complex dance” (Fahlman)
• Trains fewer PEs at each epoch
• Can cache output of hidden PEs once weights frozen
“The Cascade-Correlation Learning Architecture”
Scott E. Fahlman and Christian Lebiere
Cascade Correlation