Upload
lidia
View
49
Download
0
Embed Size (px)
DESCRIPTION
CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya Training The Feedforward Network; Backpropagation Algorithm. Multilayer Feedforward Network. - Needed for solving problems which are not linearly separable. - Hidden layer neurons: assist computation. ……. - PowerPoint PPT Presentation
Citation preview
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 1
CS 621 Artificial Intelligence
Lecture 25 – 14/10/05
Prof. Pushpak Bhattacharyya
Training The Feedforward Network;Backpropagation Algorithm
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 2
Multilayer Feedforward Network
- Needed for solving problems which are not linearly separable.
- Hidden layer neurons: assist computation.
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 3
……..
……..
……..
Output layer
Hidden layer
Input layer
……..
Forward connection; no feedback connection
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 4
Gradient Descent Rule
ΔWji α - δE/ δWji
fed feeding
P M
E = error = ½ Σ Σ( tm – om) 2
p=1 m=1
TOTAL SUM SQUARE ERROR(TSS)
j
i
Wji
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 5
Gradient Descent For a Single Neuron
y
Wn
Xn
Wn-1
Xn-1
W0 = 0
X0 = -1
….
nNet input = Σ WiXi
i=0
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 6
f = sigmoid = 1 / ( 1+ e-net )
y
net
df
y= f(net)
Characteristic function
= f(1-f)dnet
f =
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 7
ΔWi - δE/ δWi
E = ½( t- o)2
target observed
α
Y = 0
Wn
Xn
Wn-1
Xn-1
W0
X0
….
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 8
W = <Wn, ……, W0>
randomly initialized
ΔWi - δE/ δWi
= - η δE/ δWi , η is the learning rate
0 <= η <=1
α
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 9
ΔWi = - η δE / δWi
δE / δWi = δ(1/2(t - o)2) / δWi
= (δE / δo) * (δo / δWi ); chain rule
= - (t - o) * (δo / δnet) * ( δnet / δWi)
E
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 10
δo / δnet = δ f(net) / δnet
= f (net)
= f ( 1 - f )
= o ( 1 - o )
o
net
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 11
δnet / δWi = xi
nnet = ΣWiXi
i = 0
y
Wn
Xn
Wi
Xi
W0
X0
….….W
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 12
E = ½ (t - o)2
ΔWi = η (t - o) (1 - o) o Xi
δE / δo
δf / δnet
δnet / δWi
o
Wn
Xn
Wi
Xi
W0
X0
….…. W
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 13
E = ½( t - o) 2
ΔWi = η (t - o) (1 - o) o Xi
Obs:
Xi = 0 , ΔWi = 0
If Xi is more, so is the ΔWi
BLAME/CREDIT ASSIGNMENT
o
Wn
Xn
Wi
Xi
W0
X0
….….
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 14
More the difference ( t – o ), more is Δw.
If( t – o ) is +ve , so is Δw
If( t – o ) is –ve, so is Δw
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 15
If o is 0/1 , Δw = 0
o is 0/1 when net = - ∞ or + ∞
Δw 0 because of o 0/1. It is called “saturation” or “paralysis’ of the network. It happens due to sigmoid.
o 1
net
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 16
Solution to network saturation
y = k / (1+e–x)
y = tanh(x)
k
x- k
1.
2. k
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 17
Scale the inputs
Reduced the values
Problem of floating/fixed number
representation error.
3.
Solution to network saturation
(Contd)
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 18
ΔWi = η ( t - o) o ( 1 – o) Xi
Smaller η smaller ΔW
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 19
Start with large η, gradually decrease it.
E
Wi
Global minimum
op. pt
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 20
Gradient Descent training is typically slow:
First parameter: η ; learning rate
Second parameter: β; Momentum factor 0 <= β <= 1
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 21
Use a part of previous weight Change into the current weight change.
(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1
Iteration
Momentum Factor
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 22
Effect of β
If (ΔWi)n and (ΔWi)n-1 are of same sign then (ΔWi)n is enhanced.
If (ΔWi)n and (ΔWi)n-1 are of opposite sign then effective (ΔWi)n is reduced.
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 23
1) Accelerates movement at A.
2) Dampens oscillation near global minimum.
E
W
P Q
R S
A
op. pt
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 24
(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi )n-1
Relation between η and β ?
Pure gradient descent momentum
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 25
η >> β ?
η << β ?
(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1
Relation between η and β
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 26
If η << β
(ΔWi)n = β(ΔWi)n-1
recurrence Relation
(ΔWi )n = β(ΔWi)n-1
= β[β(ΔWi)n-2] = β2[β(ΔWi)n-3]
.
.
. = βn(ΔWi)0
Relation between η and β (Contd)
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 27
β is typically 1/10 th of ηEmpirical Practice
If β is very large compared to η, no effect of output error, input or neuron characteristics is felt. Also (ΔW) goes on decreasing since β is a fraction.
Relation between η and β (Contd)