Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

1

Today’s Topics• Read: Chapters 7, 8, and 9 on

Logical Representation and Reasoning• HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry)

• Recipe for using Backprop to Train an ANN

• Adjusting the Learning Rate (η)

• The Momentum Term () • Reducing Overfitting in ANNs

– Early Stopping

– Weight Decay

• Understanding Hidden Units

• Choosing the Number of Hidden Units

• ANNs as Universal Approximators

• Learning what an ANN has Learned11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Using BP to Train ANN’s

1. Initialize weights & bias to small random values (eg, in [-0.3, 0.3])

2. Randomize order of training examplesFor each do:a) Propagate activity forward to output units

k j i

outi = F( wi,j x outj )

2


Using BP to Train ANN’s (continued)

b) Compute ‘deviation’ for output units

c) Compute ‘deviation’ for hidden units

d) Update weights

i = F '( neti ) x ( Teacheri - outi )

j = F '( netj ) x ( wi,j x i )

Dwi,j = x i x outj

Dwj,k = x j x outk

F '( netj ) F(neti) neti

3

Aside The book (Fig 18.24) uses instead of , g instead of F and instead of


Using BP to Train ANN’s (concluded)

3. Repeat until training-set error rate small enough

Actually, should use early stopping (ie, minimize error on the tuning set; more details later)

Some jargon: Each cycle through all training examples is called an epoch

4. Measure accuracy on test set to estimate generalization (future accuracy)

4


The Need for Symmetry Breaking (if HUs)

Assume all weights are initially the same(drawing below a bit more general)

aa

bb

d

d

Can the corresponding (mirror-image) weight ever differ?

WHY? Solution randomize initial weights (in, say, [-0.3, 0.3])

5

NO

by symmetry (the two HUs in identical environments)


Choosing η (‘the learning rate’)E

rror

weight space

∂ E

∂ W ij

η too large (error )

η too small (error )

6

-


Adjusting η On-the-Fly

0. Let η = 0.25

1. Measure ave. error over k examples

- call this E before

2. Adjust wgts according to learning algorithm being used

3. Measure ave error on same k examples

- call this E after

7


Adjusting η (cont)

4. If E after > E before,

then η η 0.99

else η η 1.01

5. Go to 1

Note: k can be all training examples but could be a subset

8

Including a ‘Momentum’ Term in Backprop

To speed up convergence, often another term is added to the weight-update rule

Typically, 0 < β < 1, 0.9 common choice

)1()( ,,

,

tWW

EtW ji

jiji

The previous change in weight

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 9

Overfitting Reduction

Approach #1: Using Tuning Sets (Known as ‘Early Stopping’)

50%

Err

or

Tune

Test

Train

Ideal ANN to choose

Chosen ANN Training Epochs


Overfitting Reduction

Approach #2: Minimizing a Cost Function– Cost = Error Rate + Network Complexity– Essentially what SVMs do (later)

weights

wRateErrorSetTrainCost2

2

1


Need to tune the parameter lambda

(so still use a tuning set)

Overfitting Reduction: Weight Decay (Hinton ’86)

Weights decay toward zeroEmpirically improves generalization

jijiji

WW

E

w

Cost,

,,

as same

jijji woutoutputteacherw ,, )(

So …


1. Transform the input space into a new space where perceptrons suffice (relates to SVMs)

2. Probabilistically represent ‘hidden features’ – constructive induction, predicate invention, learning representations, etc (construct new features out of those given; ‘kernels’ do this in SVMs)


Four Views of What Hidden Units Do (Not Necessarily Disjoint)

A perceptron

Four Views of What Hidden Units Do (Not Necessarily Disjoint)

3. Divide feature space into many subregions

4. Provide a set of basis functions which can be linearly combined (relates to SVMs)


++

+

++

+

+ -

++

+

+

++

+

+

+

+

+

--

-

-

-

-

-

-

-

-

-- -

-

+ + ++

+

---

--

How Many Hidden Units?

Historically one hidden layer is used– How many units should it contain?

• Too few: can’t learn• Too many: poor generalization

– Use tuning set or cross-validation to select number of hidden units

• Traditional Approach (but no longer recommended)

‘conventional view’


Can One Ever Have Too Many Hidden Units?

Evidence (Weigand, Caruana) suggests that if ‘early stopping’ is used

– Generalization does not degrade as number of hidden units ∞

– Ie, use tuning set to detect over fitting (recall ‘early stopping’ slide)

– Weigand gives an explanation in terms of ‘effective number’ of HUs (analysis based on principle components and Eigenvectors)


ANNs as ‘Universal Approximators’

• Boolean Functions– Need one layer of hidden units

to represent exactly

• Continuous Functions– Approximation to arbitrarily small error

with one (possibly quite ‘wide’) layer of hidden units

• Arbitrary Functions– Any function can be approximated to arbitrary

precision with two layers of hidden units


But note, what can be REPRESENTED is

different from what can be ‘easily’ LEARNED

1/2

1

1

1/3

1/2

looks for"1101"

looks for"1001"

An "OR" ofall the

positiveexamples looked for

1/3

1/3

Looking for Specific Boolean Inputs(eg, memorize the POS examples)

Hence with enough hidden units can ‘memorize’ the training data

But what about generalization?

-∞

Use bias=0.99 for all nodes

(assume ‘step functions’)


-∞

-∞

Becomes an ‘or’ of all the positive examples

Understanding What a Trained ANN has Learned - Human ‘Readability’ of Trained ANNs is Challenging

Training Examples

ExtractionAlgorithm(TREPAN)

Rule Extraction (Craven & Shavlik, 1996)

Roughly speaking, train ID3 to learn the I/O behavior of theneural network – note we can generate as many labeled trainingex’s as desired by forward prop’ing through trained ANN!

Could be an ENSEMBLE of models


Documents

Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe