Neural Networks 2nd Edition Simon Haykin 柯博昌 Chap 3. Single-Layer Perceptrons

Neural Networks 2nd EditionSimon Haykin

柯博昌Chap 3. Single-Layer Perceptrons

2

Adaptive Filtering Problem

Dynamic System The external behavior of the system:T: {x(i), d(i); i=1, 2, …, n, …}

where x(i)=[x1(i), x2(i), …, xm(i)]T

x(i) can arise from: Spatial: x(i) is a snapshot of data. Temporal: x(i) is uniformly spaced in time.Signal-flow Graph of the

Adaptive Filter

Filtering Process y(i) is produced in response to x(i). e(i) = d(i) - y(i)

Adaptive Process Automatic Adjustment of the synaptic

weights in accordance with e(i).

m

kkk ixiwiviy

1

)()()()( Tm21

T

iwiwiw(i) where

iiiy

)(),...,(),(

)()()(

w

wx )()()( iyidie

3

Unconstrained Optimization Techniques

Let C(w) be a continuously differentiable function of some unknown weight (parameter) vector w.

C(w) maps w into real numbers. Goal: Find an optimal solution w* that satisfies C(w*)C(w) Minimize C

(w) with respect to w.Necessary Condition for optimality: C(w*)=0 ( is the gradient operator)

T

mwww

,...,,21

T

mwC

wC

wCC

,...,,21

w

A class of unconstrained optimization algorithm:Starting with an initial guess denoted by w(0), generate a sequence of weight vectors w(1), w(2), …, such that the cost function C(w) is reduced at each iteration of the algorithm.

4

Method of Steepest Descent

The successive adjustments applied to w are in the direction of steepest descent, that is, in a direction opposite to the gradient vector C(w).

wg CLetThe steepest descent algorithm: w(n+1)=w(n)-g(n)

: a positive constant called the stepsize or learning-rate parameter.w(n) = w(n+1) - w(n) = -g(n)

Small Overdamp the transient response.

Large Underdamp the transient response.

If exceeds a certain value, the algorithm becomes unstable.

5

Newton’s Method

Applying second-order Taylor series expansion of C(w) around w(n).

nnHnnn

nCnCnC

TT wwwg

www

21

1

2

2

2

2

1

2

2

2

22

2

12

21

2

21

2

21

2

2

mmm

m

m

wC

wwC

wwC

wwC

wC

wwC

wwC

wwC

wC

CH

w

C(w) is minimized when

0

nnHn

nnC wg

ww

nnHn gw 1

nnHn

nnn

gwwww

1

1

Generally speaking, Newton’s method converges quickly

Minimize the quadratic approximation of the cost function C(w) around the current point w.

6

Gauss-Newton Method

Let

n

i

ieC1

2

21w

nm

m

m

wne

wne

wne

we

we

we

we

we

we

n

ww

J

21

21

21222

111

n1,2,...,i nieieieT

n

,)(),()(

www

www

Te(n)e(2),...,e(1),(n) wherennnn ewwJewe ,)(),(

Gauss-Newton method is applicable to a cost function C(w) that is the sum of error squares.

The Jacobian J(n) is [e(n)]T )(),...,2(),1()( neeen e

Goal:

2),(

21arg1 wew

wnmin n

7

Gauss-Newton Method (Cont.)

nnnnnnnnn

scalars.are them of both and nnnnnn

nnnnnnnnnnn

nnnnnn

nnnnnn

nnn

TTTT

TTTT

TTTTTT

TTT

T

T

wwJJwwwwJeewe

eJwwwwJe

wwJJwweJwwwwJee

wwJeJwwe

wwJewwJe

wewewe

21)()(

21),(

21

)()(

)()()(21

)()(21

)()(21

),(),(21),(

21

22

2

2

Differentiating this expression with respect to w and setting the result to be zero. 0)( nnnnn TT wwJJeJ )(1

1nnnnnn TT eJJJww

To guard against the possibility that J(n) is rank deficient.

)(11

nnnnnn TT eJIJJww

8

Linear Least-Squares Filter

Characteristics of Linear Least-Squares Filter– The single neuron around which it is built is linear.– The cost function C(w) consists of the sum of error squares.

nnn

nnnn T

wXdwxxxde

)()()(),...,2(),1()()(

where d(n)=[d(1), d(2),…, d(n)]T X(n)=[x(1), x(2),…, x(n)]T

)()()( nnnn TXe

we

Substituting it into equation derived from Gauss-Newton Method

)()( nn XJ

)(

)(11

1

nnnn

nnnnnnnnTT

TT

dXXX

wXdXXXww

nnnn Let TT XXXX 1 )(1 nnn dXw

9

Wiener Filter Limiting form of the Linear Least-Squares Filter for an Ergodic Environment

Let w0 denote the Wiener solution to the linear optimum filtering problem.

d

T

n

T

n

TT

nn

nnnn

nnnnn

xx rR

dXXX

dXXXww

1

1

10

)(limlim

)(lim1lim

Let Rx denote the Correlation Matrix of input vector x(i).

)()(1lim)()(1lim)()(1

nnn

iin

iiE T

n

n

i

T

n

T XXxxxxRx

Let rxd denote the Cross-correlation Vector of x(i) and d(i).

)()(1lim)()(1lim)()(1

nnn

idin

idiEr T

n

n

ind dXxxx

10

nennnnC n

nne xg

wx

w

ˆ

Least-Mean-Square (LMS) Algorithm

neC 2

21

wLMS is based on instantaneous values for the cost function

e(n) is the error signal measured at time n.

www

neneC nnndne because T wx

nennn xww ˆ1ˆ

is used in place of w(n) to emphasize that LMS produces an estimate of w that result from the method of steepest descent. nw

n(n)e(n)1)(n (n)(n)-d(n)e(n)

compute,1,2,n For n.Computatio (0) Settion.Initializa

:parameter selected-Userd(n) response Desired

(n) vector signalInput: SampleTraining

T

xwwxw

0w

x

ˆˆˆ

ˆSummary of the LMS Algorithm

11

Virtues and Limitations of LMS

Virtues– Simplicity

Limitations– Slow rate of convergence– Sensitivity to variations in the eigenstructure of th

e input

12

Learning Curve

13

Learning Rate Annealing

Normal Approach: n all for n 0

Stochastic Approximation: constant a is c ncn

There is a danger of parameter blowup for small n when c is large.

Search-then-converge schedule: constants are and n

n 0

/1

0

14

Perceptron

bxwvm

iii

1

x1

x2

xm

......

Bias, b

vk Outputyk

w1w2

j×

wm

Hard liniterInputs

Let x0=1 and b=w0

nn

nxnwnv

T

m

iii

xw

0

The simplest form used for the classification of patterns said to be linearly separable.

Goal: Classify the set {x(1), x(2), …, x(n)} into one of two classes, C1 or C2.

Decision Rule: Assign x(i) to class C1 if y=+1 and to class C2 if y=-1.

wTx > 0 for every input vector x belonging to class C1

wTx 0 for every input vector x belonging to class C2

15

Perceptron (Cont.)

Algorithms:

1. w(n+1)=w(n) if wTx(n) > 0 and x(n) belongs to class C1

w(n+1)=w(n) if wTx(n) 0 and x(n) belongs to class C2

2. w(n+1)=w(n)-(n)x(n) if wTx(n) > 0 and x(n) belongs to class C2

w(n+1)=w(n)+(n)x(n) if wTx(n) 0 and x(n) belongs to class C1

Let

2

1

C class to belongs (n) ifC class to belongs (n) if

ndxx

11

w(n+1) = w(n) + [d(n)-y(n)]x(n) (Error-correction learning rule form)

Smaller provides stable weight estimates. Larger provides fast adaption.

Documents

Neural Networks 2nd Edition Simon Haykin 柯博昌 Chap 3. Single-Layer Perceptrons