36
Section 5: Radial Basis Function (RBF) Networks Course: Introduction to Neural Networks Instructor: Jeen-Shing Wang Department of Electrical Engineering National Cheng Kung University Fall, 2005

Section5 Rbf

  • Upload
    kylin

  • View
    1.630

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Section5 Rbf

Section 5: Radial Basis Function (RBF) Networks

Course: Introduction to Neural NetworksInstructor: Jeen-Shing Wang

Department of Electrical EngineeringNational Cheng Kung University

Fall, 2005

Page 2: Section5 Rbf

2 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Outline

Origin: Cover’s theorem Interpolation problem Regularization theory Generalized RBFN

Universal approximation Comparison with MLP RBFN = Kernel regression

Learning Centers, width, and weights

Simulations

Page 3: Section5 Rbf

3 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Origin: Cover’s Theorem

A complex pattern-classification problem cast in a high dimensional space nonlinearly is more likely to be linearly separable than in a low dimensional space (Cover, 1965).

Cover, T. M., 1965. “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.” IEEE transactions on Electronic Computers, EC-14, 326-334

Page 4: Section5 Rbf

4 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Cover’s Theorem

Covers’ theorem on separability of patterns (1965)

x1, x2, …, xN, xi p, are assigned to two classes C1

and C2

-separability:

, 0

0 s.t.

2T

1T

C

C

xxw

xxww

where (x) = [1(x), 2(x), …,M(x)]T.

Page 5: Section5 Rbf

5 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Cover’s Theorem (cont’d)

Two basic ingredients of Covers’ theorem: Nonlinear functions (x) Dimensions of hidden space (M) > Dimensions of

input space (P) → probability of separability closer to 1

x xxx

x

x

xx

(a) (b) (c)

Linear Separable Spherically Separable Quadrically Separable

Page 6: Section5 Rbf

6 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Interpolation Problem

Given points (xi, di), xip, di, 1 ≤ i ≤ N: Find F(xi) = di, 1 ≤ i ≤ N Radial basis function (RBF) technique

(Powell, 1988):

are arbitrary nonlinear functions Number of functions is the same as number of data points Centers are fixed at known points xi.

ixx

N

iiiF

1

)( xxwx

Page 7: Section5 Rbf

7 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Interpolation Problem (cont’d)

Matrix form: Vital question: Is non-singular?

N

iiiF

1

)( xxwx

ijji xx

ii dF )(x

NNNNNN

N

N

d

d

d

w

w

w

2

1

2

1

21

22221

11211

where

dφwdφw 1

Page 8: Section5 Rbf

8 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Michelli’s Theorem

If points xi are distinct, is non-singular (regardless of the dimension of the input space)

Valid for a large class of RBF functions:

12 2 2

2

2

1. Inverse multiquadrics

1 for some 0 and 0

2. Gaussian functions

exp for 0 and 02

r c rr c

rr r

Page 9: Section5 Rbf

9 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Learning: Ill and Well Posed Problems

Given a set of data points, learning is viewed as a hypersurface reconstruction or approximation problem---inverse problem

Well posed problem A mapping from input to output exists for all input values The mapping is unique The mapping function is continuous

Ill posed problem Noise or imprecision data adds uncertainty to reconstruct the

mapping uniquely Not enough training data to reconstruct the mapping uniquely.

Degraded generalization performance Regularization is needed

Page 10: Section5 Rbf

10 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Regularization Theory

The basic idea of regularization is to stabilize the solution by means of some auxiliary functions that embeds prior information, e.g., smoothness constraints on the input-output mapping (i.e., solution to the approximation problem), And thereby make an ill-posed problem into a well-posed one. (Poggio and Girosi, 1990).

Page 11: Section5 Rbf

11 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Solution to the Regularization Problem

Minimize the cost function E(F) w.r.t F

,)(2

1)(

2

1

)()()(

1

2

N

iii

cs

FCFd

FEFEFE

x

Standard error term Regularizing term

where is the regularization parameter

Page 12: Section5 Rbf

12 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Solution to the Regularization Problem

Poggio & Griosi (1990): If C(F) is a (problem-dependent) linear differential

operator, the solution to

2

1

1

1

1 1( ) ( ) ( ),

2 2

is of the following form:

( ) ( , ),

where ( ) is a Green's function,

( ) , where ( , ).

N

i ii

N

i ii

ki k i

E F d F C F

F w G

G

G G

x

x x x

w G I d x x

Page 13: Section5 Rbf

13 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Interpolation vs Regularization

Interpolation

Exact interpolator

Possible RBF

Regularization

Exact interpolator Equal to the “interpolation”

solution if = 0 Example of Green’s function

dIGw

xxx

1

1

)(

),,()(

N

iiiGwF

dφw

xxx

1

1

)(

N

iiiwF

2

2

2exp,

i

i

xxxx

2

2

2exp,

i

iGxx

xx

Page 14: Section5 Rbf

14 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Generalized RBF Network (GRBFN)

As many radial basis functions as training patterns

Computationally intensive Ill-conditioned matrix Regularization is not easy (C(F) is problem-dependent)

Possible solution → Generalized RBFN approach

2

2

1

2exp),(

:Typically

),,()(

i

ii

K

iii

NK

wF

cxcx

xxx

Adjustable parameters

Page 15: Section5 Rbf

15 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

D-Dimensional Gaussian Distribution

d-Dimensional Gaussian Distribution

d

2

1

dx

x

x

X2

1

2

22

21

000

0

00

00

d

dxxx ,, 21 ( ind. each other)

)}()(2

1exp{

||)2(

1)( 1

2/12/

XXXp T

d

}2

)(

2

)(

2

)(exp{

)2(

1)(

2

2

22

222

21

211

212/

d

dd

dd

xxxXp

(General form)

}2

||||exp{

)2(

1)(

2

2

2/

XXp

dd)( 21 d

Page 16: Section5 Rbf

16 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

D-Dimensional Gaussian Distribution

2-Dimensional Gaussian Distribution

2

1

x

xX

0

0

10

01

10 20 30 40 50 60

10

20

30

40

50

60

x1

x2

Page 17: Section5 Rbf

17 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

2

2( , ) exp2

ii

i

x cx c

1x 2x 3x 4x

1 2 3j J

1O KO

Radial Basis Function Networks

1

( ) ( , )K

i ii

F w

x x x

Page 18: Section5 Rbf

18 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

RBFN: Universal Approximation

Park & Sandberg (1991): For any continuous input-output mapping function f(x)

The theorem is stronger (radial symmetry is not needed) K is not specified Provides a theoretical basis for practical RBFN!

),1,0(

))(),(( s.t. )(1

p

xFxfLwF p

K

iii

xxx

Page 19: Section5 Rbf

19 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Kernel Regression

Consider the nonlinear regression model :

Recall:

From probability theory, By using (2) in (1),

),...,2,1()( Nify iii x

)1()|(]|[)( dyyyfyEf Y xxx

, ( , )( | ) (2)

( )Y

Y

f yf y

f X

X

xx

x

)3()(

),()(

,

x

xx

X

X

f

dyyyff

Y

Page 20: Section5 Rbf

20 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Kernel Regression

We do not know the , which can be estimated by Parzen-Rosenblatt density estimator :

),(, yf Y xX

0

01

1ˆ ( ) ( ) for (4)N

mim

i

f KNh h

x

x xx x

0

0, 11

1ˆ ( , ) ( ) ( ) for ,

where : positive number, and

( ) : symmetric about the origin, ( ) 1)m

Nmi i

Y mi

y yf y K K y

Nh h h

h

K K d

X

R

x xx x

x x x

Page 21: Section5 Rbf

21 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Kernel Regression

Integration and by using , and using the symmetric property of K, we get :

hyyz i /)( ),(ˆ

, yfy Y xX

)5()(1

),(ˆ1

, 0

N

i

iimY hKy

Nhdyyfy

xxxX

Page 22: Section5 Rbf

22 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Kernel Regression

By using (4) and (5) as estimates of part of (3) :

)6()(

)()(ˆ

1

1

N

i

i

N

i

ii

hK

hKy

fxx

xx

xF(x)

Page 23: Section5 Rbf

23 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Nadaraya-Watson Regression Estimator

By define the normalized weighting function :

We can rewrite (6) as :

F(x) : a weighted average of the y-observables

,

1

,1

( )( ) ( 1,2,... )

( )

with ( ) 1 for all x

i

N i Ni

i

N

N ii

KhW i N

Kh

W

x x

xx x

x

N

iiiN yWF

1, )()( xx

Page 24: Section5 Rbf

24 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Normalized RBF Network

Assume the spherical symmetry of K(x), then :

Normalized radial basis function is defined:

( ) ( ) for alliiK K ih h

x xx x

1

1

( )( , ) ( 1,2,... )

( )

( , ) 1 for all

i

N i Ni

i

N

N ii

Kh i N

Kh

with

x x

x xx x

x x x

Page 25: Section5 Rbf

25 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Normalized RBF Network

let for all i, we may rewrite (6) as :

may be interpreted as the probability of an event x conditional on xi

ii wy

N

iiNiwF

1

),()( xxx

),( iN xx

Page 26: Section5 Rbf

26 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Multivariate Gaussian Distribution

If we take the kernel function as the multivariate Gaussian Distribution :

Then we can write :

)2

exp()2(

1)(

2

2/0

xx mK

),...,2,1()2

exp()2(

1)(

2

2

2/2 0Ni

hK i

mi

xxxx

Page 27: Section5 Rbf

27 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Multivariate Gaussian Distribution

And the NRBF is:

The centers of the RBF coincide with the data points

)7(

)2

exp(

)2

exp()(

12

21

2

2

N

i

i

N

i

iiw

F

xx

xx

x

Niix 1}{

Page 28: Section5 Rbf

28 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

RBFN vs MLP

RBFN Single hidden layer

Nonlinear hidden layer and linear output layer

Argument of hidden units: Euclidean norm

Universal approximation property

Local approximators

MPL Single or multiple hidden

layers Nonlinear hidden layer and

linear or nonlinear output layer

Argument of hidden units: scalar product

Universal approximation property

Global approximators

Page 29: Section5 Rbf

29 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Learning Strategies

Parameters to be determined: wi, ci, and i

Traditional learning strategy: splitted computation Centers, ci

Widths, i

Weights, wi

2

2

1 2exp),( ),()(

i

ii

K

iiiwF

cxcxxxx

Page 30: Section5 Rbf

30 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Computation of Centers

Vector quantization: Centers ci must have the density properties of training patterns xi Random selection from the training set Competitive learning Frequency-sensitive learning Kohonen learning

This phase only uses the input (xi) information, not the output (di)

Page 31: Section5 Rbf

31 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

K-Means Clustering

k(x) = index of best-matching (winning) neuron:

M = number of clusters

where tk(n) = location of the kth center.

arg min ( ) , 1,2, , kk

n k M x t

( ) + ( ) ( ) , for ( )1

( ), otherwisek k

kk

n n n k k xn

n

t x tt

t

Page 32: Section5 Rbf

32 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Computation of Widths

Universal approximation property: valid with identical widths

In practice (limited training patterns), variables widths i

One approach: Use local clusters Select i according to the standard deviation of

clusters

Page 33: Section5 Rbf

33 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Red-dotted line: Estimated distribution Blue-solid line: Actual distribution

Computation of Widths (cont’d.)

Page 34: Section5 Rbf

34 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Computation of Weights (SVD)

Problem becomes linear! Solution of least square criterion

In practice, use SVD

2

2

1 2exp),( ),()(

i

ii

K

iiiwF

cxcxxxx

Keep Constant

dφφφdφw

x

TT

N

iii FdFE

1

1

2

)(

tolead )(2

1)(

Page 35: Section5 Rbf

35 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Computation of Weights (Gradient Descent)

Linear weights (output layer)

Positions of centers (hidden layer)

Widths of centers (hidden layer)

N

jijj

i

nnenw

n

1

||))((||)()(

)(cxφ

Minw

nEnwnw

iii ,...,2,1 ,

)(

)()()1( 1

N

jijiijji

i

nnnenwn

nE

1

1' )]([||))((||)()(2)(

)(cxcxφ

c

Min

nEnn

iii ,...,2,1 ,

)(

)()()1( 2

c

cc

N

jjiijji

i

nnnenwn

nE

1

'1

)(||))((||)()()(

)(Qcxφ

Tijijji nnn )]()][([)( cxcxQ

)(

)()()1(

1311

n

nEnn

iii

Page 36: Section5 Rbf

36 Computational Intelligence & Learning Systems Lab

EE Department, NCKU

Summary

Learning is finding surface in multidimensional space best fit to training data

Approximate function with linear combination of Radial basis functions

F(x) = wiG(||x-xi||) i = 1, 2, … , N G(||x-xi||) is called a Green function

It can be a uniform or Gaussian function.

When N= number of sample, we call it regularization network

When N < number of sample, it is a radial basis function network