Upload
todd-booth
View
326
Download
4
Embed Size (px)
DESCRIPTION
3 Unconstrained Optimization Techniques Let C(w) be a continuously differentiable function of some unknown weight (parameter) vector w. C(w) maps w into real numbers. Goal: Find an optimal solution w* that satisfies C(w*) C(w) Minimize C(w) with respect to w. Necessary Condition for optimality: C(w*)=0 ( is the gradient operator) A class of unconstrained optimization algorithm: Starting with an initial guess denoted by w(0), generate a sequence of weight vectors w(1), w(2), …, such that the cost function C(w) is reduced at each iteration of the algorithm.
Citation preview
Neural Networks 2nd EditionSimon Haykin
柯博昌Chap 3. Single-Layer Perceptrons
2
Adaptive Filtering Problem
Dynamic System The external behavior of the system:T: {x(i), d(i); i=1, 2, …, n, …}
where x(i)=[x1(i), x2(i), …, xm(i)]T
x(i) can arise from: Spatial: x(i) is a snapshot of data. Temporal: x(i) is uniformly spaced in time.Signal-flow Graph of the
Adaptive Filter
Filtering Process y(i) is produced in response to x(i). e(i) = d(i) - y(i)
Adaptive Process Automatic Adjustment of the synaptic
weights in accordance with e(i).
m
kkk ixiwiviy
1
)()()()( Tm21
T
iwiwiw(i) where
iiiy
)(),...,(),(
)()()(
w
wx )()()( iyidie
3
Unconstrained Optimization Techniques
Let C(w) be a continuously differentiable function of some unknown weight (parameter) vector w.
C(w) maps w into real numbers. Goal: Find an optimal solution w* that satisfies C(w*)C(w) Minimize C
(w) with respect to w.Necessary Condition for optimality: C(w*)=0 ( is the gradient operator)
T
mwww
,...,,21
T
mwC
wC
wCC
,...,,21
w
A class of unconstrained optimization algorithm:Starting with an initial guess denoted by w(0), generate a sequence of weight vectors w(1), w(2), …, such that the cost function C(w) is reduced at each iteration of the algorithm.
4
Method of Steepest Descent
The successive adjustments applied to w are in the direction of steepest descent, that is, in a direction opposite to the gradient vector C(w).
wg CLetThe steepest descent algorithm: w(n+1)=w(n)-g(n)
: a positive constant called the stepsize or learning-rate parameter.w(n) = w(n+1) - w(n) = -g(n)
Small Overdamp the transient response.
Large Underdamp the transient response.
If exceeds a certain value, the algorithm becomes unstable.
5
Newton’s Method
Applying second-order Taylor series expansion of C(w) around w(n).
nnHnnn
nCnCnC
TT wwwg
www
21
1
2
2
2
2
1
2
2
2
22
2
12
21
2
21
2
21
2
2
mmm
m
m
wC
wwC
wwC
wwC
wC
wwC
wwC
wwC
wC
CH
w
C(w) is minimized when
0
nnHn
nnC wg
ww
nnHn gw 1
nnHn
nnn
gwwww
1
1
Generally speaking, Newton’s method converges quickly
Minimize the quadratic approximation of the cost function C(w) around the current point w.
6
Gauss-Newton Method
Let
n
i
ieC1
2
21w
nm
m
m
wne
wne
wne
we
we
we
we
we
we
n
ww
J
21
21
21222
111
n1,2,...,i nieieieT
n
,)(),()(
www
www
Te(n)e(2),...,e(1),(n) wherennnn ewwJewe ,)(),(
Gauss-Newton method is applicable to a cost function C(w) that is the sum of error squares.
The Jacobian J(n) is [e(n)]T )(),...,2(),1()( neeen e
Goal:
2),(
21arg1 wew
wnmin n
7
Gauss-Newton Method (Cont.)
nnnnnnnnn
scalars.are them of both and nnnnnn
nnnnnnnnnnn
nnnnnn
nnnnnn
nnn
TTTT
TTTT
TTTTTT
TTT
T
T
wwJJwwwwJeewe
eJwwwwJe
wwJJwweJwwwwJee
wwJeJwwe
wwJewwJe
wewewe
21)()(
21),(
21
)()(
)()()(21
)()(21
)()(21
),(),(21),(
21
22
2
2
Differentiating this expression with respect to w and setting the result to be zero. 0)( nnnnn TT wwJJeJ )(1
1nnnnnn TT eJJJww
To guard against the possibility that J(n) is rank deficient.
)(11
nnnnnn TT eJIJJww
8
Linear Least-Squares Filter
Characteristics of Linear Least-Squares Filter– The single neuron around which it is built is linear.– The cost function C(w) consists of the sum of error squares.
nnn
nnnn T
wXdwxxxde
)()()(),...,2(),1()()(
where d(n)=[d(1), d(2),…, d(n)]T X(n)=[x(1), x(2),…, x(n)]T
)()()( nnnn TXe
we
Substituting it into equation derived from Gauss-Newton Method
)()( nn XJ
)(
)(11
1
nnnn
nnnnnnnnTT
TT
dXXX
wXdXXXww
nnnn Let TT XXXX 1 )(1 nnn dXw
9
Wiener Filter Limiting form of the Linear Least-Squares Filter for an Ergodic Environment
Let w0 denote the Wiener solution to the linear optimum filtering problem.
d
T
n
T
n
TT
nn
nnnn
nnnnn
xx rR
dXXX
dXXXww
1
1
10
)(limlim
)(lim1lim
Let Rx denote the Correlation Matrix of input vector x(i).
)()(1lim)()(1lim)()(1
nnn
iin
iiE T
n
n
i
T
n
T XXxxxxRx
Let rxd denote the Cross-correlation Vector of x(i) and d(i).
)()(1lim)()(1lim)()(1
nnn
idin
idiEr T
n
n
ind dXxxx
10
nennnnC n
nne xg
wx
w
ˆ
Least-Mean-Square (LMS) Algorithm
neC 2
21
wLMS is based on instantaneous values for the cost function
e(n) is the error signal measured at time n.
www
neneC nnndne because T wx
nennn xww ˆ1ˆ
is used in place of w(n) to emphasize that LMS produces an estimate of w that result from the method of steepest descent. nw
n(n)e(n)1)(n (n)(n)-d(n)e(n)
compute,1,2,n For n.Computatio (0) Settion.Initializa
:parameter selected-Userd(n) response Desired
(n) vector signalInput: SampleTraining
T
xwwxw
0w
x
ˆˆˆ
ˆSummary of the LMS Algorithm
11
Virtues and Limitations of LMS
Virtues– Simplicity
Limitations– Slow rate of convergence– Sensitivity to variations in the eigenstructure of th
e input
12
Learning Curve
13
Learning Rate Annealing
Normal Approach: n all for n 0
Stochastic Approximation: constant a is c ncn
There is a danger of parameter blowup for small n when c is large.
Search-then-converge schedule: constants are and n
n 0
/1
0
14
Perceptron
bxwvm
iii
1
x1
x2
xm
......
Bias, b
vk Outputyk
w1w2
j×
wm
Hard liniterInputs
Let x0=1 and b=w0
nn
nxnwnv
T
m
iii
xw
0
The simplest form used for the classification of patterns said to be linearly separable.
Goal: Classify the set {x(1), x(2), …, x(n)} into one of two classes, C1 or C2.
Decision Rule: Assign x(i) to class C1 if y=+1 and to class C2 if y=-1.
wTx > 0 for every input vector x belonging to class C1
wTx 0 for every input vector x belonging to class C2
15
Perceptron (Cont.)
Algorithms:
1. w(n+1)=w(n) if wTx(n) > 0 and x(n) belongs to class C1
w(n+1)=w(n) if wTx(n) 0 and x(n) belongs to class C2
2. w(n+1)=w(n)-(n)x(n) if wTx(n) > 0 and x(n) belongs to class C2
w(n+1)=w(n)+(n)x(n) if wTx(n) 0 and x(n) belongs to class C1
Let
2
1
C class to belongs (n) ifC class to belongs (n) if
ndxx
11
w(n+1) = w(n) + [d(n)-y(n)]x(n) (Error-correction learning rule form)
Smaller provides stable weight estimates. Larger provides fast adaption.