CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 1

CS 621 Artificial Intelligence

Lecture 25 – 14/10/05

Prof. Pushpak Bhattacharyya

Training The Feedforward Network;Backpropagation Algorithm


Multilayer Feedforward Network

- Needed for solving problems which are not linearly separable.

- Hidden layer neurons: assist computation.


……..

……..

……..

Output layer

Hidden layer

Input layer

……..

Forward connection; no feedback connection


Gradient Descent Rule

ΔWji α - δE/ δWji

fed feeding

P M

E = error = ½ Σ Σ( tm – om) 2

p=1 m=1

TOTAL SUM SQUARE ERROR(TSS)

j

i

Wji


Gradient Descent For a Single Neuron

y

Wn

Xn

Wn-1

Xn-1

W0 = 0

X0 = -1

….

nNet input = Σ WiXi

i=0


f = sigmoid = 1 / ( 1+ e-net )

y

net

df

y= f(net)

Characteristic function

= f(1-f)dnet

f =


ΔWi - δE/ δWi

E = ½( t- o)2

target observed

α

Y = 0

Wn

Xn

Wn-1

Xn-1

W0

X0

….


W = <Wn, ……, W0>

randomly initialized

ΔWi - δE/ δWi

= - η δE/ δWi , η is the learning rate

0 <= η <=1

α


ΔWi = - η δE / δWi

δE / δWi = δ(1/2(t - o)2) / δWi

= (δE / δo) * (δo / δWi ); chain rule

= - (t - o) * (δo / δnet) * ( δnet / δWi)

E


δo / δnet = δ f(net) / δnet

= f (net)

= f ( 1 - f )

= o ( 1 - o )

o

net


δnet / δWi = xi

nnet = ΣWiXi

i = 0

y

Wn

Xn

Wi

Xi

W0

X0

….….W


E = ½ (t - o)2

ΔWi = η (t - o) (1 - o) o Xi

δE / δo

δf / δnet

δnet / δWi

o

Wn

Xn

Wi

Xi

W0

X0

….…. W


E = ½( t - o) 2

ΔWi = η (t - o) (1 - o) o Xi

Obs:

Xi = 0 , ΔWi = 0

If Xi is more, so is the ΔWi

BLAME/CREDIT ASSIGNMENT

o

Wn

Xn

Wi

Xi

W0

X0

….….


More the difference ( t – o ), more is Δw.

If( t – o ) is +ve , so is Δw

If( t – o ) is –ve, so is Δw


If o is 0/1 , Δw = 0

o is 0/1 when net = - ∞ or + ∞

Δw 0 because of o 0/1. It is called “saturation” or “paralysis’ of the network. It happens due to sigmoid.

o 1

net


Solution to network saturation

y = k / (1+e–x)

y = tanh(x)

k

x- k

1.

2. k


Scale the inputs

Reduced the values

Problem of floating/fixed number

representation error.

3.

Solution to network saturation

(Contd)


ΔWi = η ( t - o) o ( 1 – o) Xi

Smaller η smaller ΔW


Start with large η, gradually decrease it.

E

Wi

Global minimum

op. pt


Gradient Descent training is typically slow:

First parameter: η ; learning rate

Second parameter: β; Momentum factor 0 <= β <= 1


Use a part of previous weight Change into the current weight change.

(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1

Iteration

Momentum Factor


Effect of β

If (ΔWi)n and (ΔWi)n-1 are of same sign then (ΔWi)n is enhanced.

If (ΔWi)n and (ΔWi)n-1 are of opposite sign then effective (ΔWi)n is reduced.


1) Accelerates movement at A.

2) Dampens oscillation near global minimum.

E

W

P Q

R S

A

op. pt


(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi )n-1

Relation between η and β ?

Pure gradient descent momentum


η >> β ?

η << β ?

(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1

Relation between η and β


If η << β

(ΔWi)n = β(ΔWi)n-1

recurrence Relation

(ΔWi )n = β(ΔWi)n-1

= β[β(ΔWi)n-2] = β2[β(ΔWi)n-3]

.

.

. = βn(ΔWi)0

Relation between η and β (Contd)


β is typically 1/10 th of ηEmpirical Practice

If β is very large compared to η, no effect of output error, input or neuron characteristics is felt. Also (ΔW) goes on decreasing since β is a fraction.

Relation between η and β (Contd)

Documents

CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya