Upload
brenda-davis
View
214
Download
0
Embed Size (px)
Citation preview
1
Today’s Topics• Read: Chapters 7, 8, and 9 on
Logical Representation and Reasoning• HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry)
• Recipe for using Backprop to Train an ANN
• Adjusting the Learning Rate (η)
• The Momentum Term () • Reducing Overfitting in ANNs
– Early Stopping
– Weight Decay
• Understanding Hidden Units
• Choosing the Number of Hidden Units
• ANNs as Universal Approximators
• Learning what an ANN has Learned11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Using BP to Train ANN’s
1. Initialize weights & bias to small random values (eg, in [-0.3, 0.3])
2. Randomize order of training examplesFor each do:a) Propagate activity forward to output units
k j i
outi = F( wi,j x outj )
2
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Using BP to Train ANN’s (continued)
b) Compute ‘deviation’ for output units
c) Compute ‘deviation’ for hidden units
d) Update weights
i = F '( neti ) x ( Teacheri - outi )
j = F '( netj ) x ( wi,j x i )
Dwi,j = x i x outj
Dwj,k = x j x outk
F '( netj ) F(neti) neti
3
Aside The book (Fig 18.24) uses instead of , g instead of F and instead of
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Using BP to Train ANN’s (concluded)
3. Repeat until training-set error rate small enough
Actually, should use early stopping (ie, minimize error on the tuning set; more details later)
Some jargon: Each cycle through all training examples is called an epoch
4. Measure accuracy on test set to estimate generalization (future accuracy)
4
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
The Need for Symmetry Breaking (if HUs)
Assume all weights are initially the same(drawing below a bit more general)
aa
bb
d
d
Can the corresponding (mirror-image) weight ever differ?
WHY? Solution randomize initial weights (in, say, [-0.3, 0.3])
5
NO
by symmetry (the two HUs in identical environments)
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Choosing η (‘the learning rate’)E
rror
weight space
∂ E
∂ W ij
η too large (error )
η too small (error )
6
-
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Adjusting η On-the-Fly
0. Let η = 0.25
1. Measure ave. error over k examples
- call this E before
2. Adjust wgts according to learning algorithm being used
3. Measure ave error on same k examples
- call this E after
7
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Adjusting η (cont)
4. If E after > E before,
then η η 0.99
else η η 1.01
5. Go to 1
Note: k can be all training examples but could be a subset
8
Including a ‘Momentum’ Term in Backprop
To speed up convergence, often another term is added to the weight-update rule
Typically, 0 < β < 1, 0.9 common choice
)1()( ,,
,
tWW
EtW ji
jiji
The previous change in weight
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 9
Overfitting Reduction
Approach #1: Using Tuning Sets (Known as ‘Early Stopping’)
50%
Err
or
Tune
Test
Train
Ideal ANN to choose
Chosen ANN Training Epochs
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 10
Overfitting Reduction
Approach #2: Minimizing a Cost Function– Cost = Error Rate + Network Complexity– Essentially what SVMs do (later)
weights
wRateErrorSetTrainCost2
2
1
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 11
Need to tune the parameter lambda
(so still use a tuning set)
Overfitting Reduction: Weight Decay (Hinton ’86)
Weights decay toward zeroEmpirically improves generalization
jijiji
WW
E
w
Cost,
,,
as same
jijji woutoutputteacherw ,, )(
So …
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 12
1. Transform the input space into a new space where perceptrons suffice (relates to SVMs)
2. Probabilistically represent ‘hidden features’ – constructive induction, predicate invention, learning representations, etc (construct new features out of those given; ‘kernels’ do this in SVMs)
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 13
Four Views of What Hidden Units Do (Not Necessarily Disjoint)
A perceptron
Four Views of What Hidden Units Do (Not Necessarily Disjoint)
3. Divide feature space into many subregions
4. Provide a set of basis functions which can be linearly combined (relates to SVMs)
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 14
++
+
++
+
+ -
++
+
+
++
+
+
+
+
+
--
-
-
-
-
-
-
-
-
-- -
-
+ + ++
+
---
--
How Many Hidden Units?
Historically one hidden layer is used– How many units should it contain?
• Too few: can’t learn• Too many: poor generalization
– Use tuning set or cross-validation to select number of hidden units
• Traditional Approach (but no longer recommended)
‘conventional view’
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 15
Can One Ever Have Too Many Hidden Units?
Evidence (Weigand, Caruana) suggests that if ‘early stopping’ is used
– Generalization does not degrade as number of hidden units ∞
– Ie, use tuning set to detect over fitting (recall ‘early stopping’ slide)
– Weigand gives an explanation in terms of ‘effective number’ of HUs (analysis based on principle components and Eigenvectors)
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 16
ANNs as ‘Universal Approximators’
• Boolean Functions– Need one layer of hidden units
to represent exactly
• Continuous Functions– Approximation to arbitrarily small error
with one (possibly quite ‘wide’) layer of hidden units
• Arbitrary Functions– Any function can be approximated to arbitrary
precision with two layers of hidden units
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 17
But note, what can be REPRESENTED is
different from what can be ‘easily’ LEARNED
1/2
1
1
1/3
1/2
looks for"1101"
looks for"1001"
An "OR" ofall the
positiveexamples looked for
1/3
1/3
Looking for Specific Boolean Inputs(eg, memorize the POS examples)
Hence with enough hidden units can ‘memorize’ the training data
But what about generalization?
-∞
Use bias=0.99 for all nodes
(assume ‘step functions’)
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 18
-∞
-∞
Becomes an ‘or’ of all the positive examples
Understanding What a Trained ANN has Learned - Human ‘Readability’ of Trained ANNs is Challenging
Training Examples
ExtractionAlgorithm(TREPAN)
Rule Extraction (Craven & Shavlik, 1996)
Roughly speaking, train ID3 to learn the I/O behavior of theneural network – note we can generate as many labeled trainingex’s as desired by forward prop’ing through trained ANN!
Could be an ENSEMBLE of models
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 19