19
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe for using Backprop to Train an ANN Adjusting the Learning Rate (η) The Momentum Term () Reducing Overfitting in ANNs Early Stopping Weight Decay Understanding Hidden Units Choosing the Number of Hidden Units ANNs as Universal Approximators Learning what an ANN has Learned 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 1

Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Embed Size (px)

Citation preview

Page 1: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

1

Today’s Topics• Read: Chapters 7, 8, and 9 on

Logical Representation and Reasoning• HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry)

• Recipe for using Backprop to Train an ANN

• Adjusting the Learning Rate (η)

• The Momentum Term () • Reducing Overfitting in ANNs

– Early Stopping

– Weight Decay

• Understanding Hidden Units

• Choosing the Number of Hidden Units

• ANNs as Universal Approximators

• Learning what an ANN has Learned11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Page 2: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Using BP to Train ANN’s

1. Initialize weights & bias to small random values (eg, in [-0.3, 0.3])

2. Randomize order of training examplesFor each do:a) Propagate activity forward to output units

k j i

outi = F( wi,j x outj )

2

Page 3: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Using BP to Train ANN’s (continued)

b) Compute ‘deviation’ for output units

c) Compute ‘deviation’ for hidden units

d) Update weights

i = F '( neti ) x ( Teacheri - outi )

j = F '( netj ) x ( wi,j x i )

Dwi,j = x i x outj

Dwj,k = x j x outk

F '( netj ) F(neti) neti

3

Aside The book (Fig 18.24) uses instead of , g instead of F and instead of

Page 4: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Using BP to Train ANN’s (concluded)

3. Repeat until training-set error rate small enough

Actually, should use early stopping (ie, minimize error on the tuning set; more details later)

Some jargon: Each cycle through all training examples is called an epoch

4. Measure accuracy on test set to estimate generalization (future accuracy)

4

Page 5: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

The Need for Symmetry Breaking (if HUs)

Assume all weights are initially the same(drawing below a bit more general)

aa

bb

d

d

Can the corresponding (mirror-image) weight ever differ?

WHY? Solution randomize initial weights (in, say, [-0.3, 0.3])

5

NO

by symmetry (the two HUs in identical environments)

Page 6: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Choosing η (‘the learning rate’)E

rror

weight space

∂ E

∂ W ij

η too large (error )

η too small (error )

6

-

Page 7: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Adjusting η On-the-Fly

0. Let η = 0.25

1. Measure ave. error over k examples

- call this E before

2. Adjust wgts according to learning algorithm being used

3. Measure ave error on same k examples

- call this E after

7

Page 8: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10

Adjusting η (cont)

4. If E after > E before,

then η η 0.99

else η η 1.01

5. Go to 1

Note: k can be all training examples but could be a subset

8

Page 9: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Including a ‘Momentum’ Term in Backprop

To speed up convergence, often another term is added to the weight-update rule

Typically, 0 < β < 1, 0.9 common choice

)1()( ,,

,

tWW

EtW ji

jiji

The previous change in weight

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 9

Page 10: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Overfitting Reduction

Approach #1: Using Tuning Sets (Known as ‘Early Stopping’)

50%

Err

or

Tune

Test

Train

Ideal ANN to choose

Chosen ANN Training Epochs

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 10

Page 11: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Overfitting Reduction

Approach #2: Minimizing a Cost Function– Cost = Error Rate + Network Complexity– Essentially what SVMs do (later)

weights

wRateErrorSetTrainCost2

2

1

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 11

Need to tune the parameter lambda

(so still use a tuning set)

Page 12: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Overfitting Reduction: Weight Decay (Hinton ’86)

Weights decay toward zeroEmpirically improves generalization

jijiji

WW

E

w

Cost,

,,

as same

jijji woutoutputteacherw ,, )(

So …

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 12

Page 13: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

1. Transform the input space into a new space where perceptrons suffice (relates to SVMs)

2. Probabilistically represent ‘hidden features’ – constructive induction, predicate invention, learning representations, etc (construct new features out of those given; ‘kernels’ do this in SVMs)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 13

Four Views of What Hidden Units Do (Not Necessarily Disjoint)

A perceptron

Page 14: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Four Views of What Hidden Units Do (Not Necessarily Disjoint)

3. Divide feature space into many subregions

4. Provide a set of basis functions which can be linearly combined (relates to SVMs)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 14

++

+

++

+

+ -

++

+

+

++

+

+

+

+

+

--

-

-

-

-

-

-

-

-

-- -

-

+ + ++

+

---

--

Page 15: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

How Many Hidden Units?

Historically one hidden layer is used– How many units should it contain?

• Too few: can’t learn• Too many: poor generalization

– Use tuning set or cross-validation to select number of hidden units

• Traditional Approach (but no longer recommended)

‘conventional view’

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 15

Page 16: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Can One Ever Have Too Many Hidden Units?

Evidence (Weigand, Caruana) suggests that if ‘early stopping’ is used

– Generalization does not degrade as number of hidden units ∞

– Ie, use tuning set to detect over fitting (recall ‘early stopping’ slide)

– Weigand gives an explanation in terms of ‘effective number’ of HUs (analysis based on principle components and Eigenvectors)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 16

Page 17: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

ANNs as ‘Universal Approximators’

• Boolean Functions– Need one layer of hidden units

to represent exactly

• Continuous Functions– Approximation to arbitrarily small error

with one (possibly quite ‘wide’) layer of hidden units

• Arbitrary Functions– Any function can be approximated to arbitrary

precision with two layers of hidden units

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 17

But note, what can be REPRESENTED is

different from what can be ‘easily’ LEARNED

Page 18: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

1/2

1

1

1/3

1/2

looks for"1101"

looks for"1001"

An "OR" ofall the

positiveexamples looked for

1/3

1/3

Looking for Specific Boolean Inputs(eg, memorize the POS examples)

Hence with enough hidden units can ‘memorize’ the training data

But what about generalization?

-∞

Use bias=0.99 for all nodes

(assume ‘step functions’)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 18

-∞

-∞

Becomes an ‘or’ of all the positive examples

Page 19: Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe

Understanding What a Trained ANN has Learned - Human ‘Readability’ of Trained ANNs is Challenging

Training Examples

ExtractionAlgorithm(TREPAN)

Rule Extraction (Craven & Shavlik, 1996)

Roughly speaking, train ID3 to learn the I/O behavior of theneural network – note we can generate as many labeled trainingex’s as desired by forward prop’ing through trained ANN!

Could be an ENSEMBLE of models

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 19