RBF Classnote Mtech Spring2013

Radial Basis Functions An Introduction

Prof. Sarat K. PatraSenior Member, IEEE

National Institute of Technology, Rourkela

Odisha, India

Email: [email protected]

Presentation Outline

3/31/2014 Prof. Sarat Kumar Patra Radial Basis Function 2

Books and reference materials:

• S Haykin; Neural Networks – A comprehensive foundation; Pearson Education

• Christopher M Bishop; Neural Networks for Pattern recognition; Oxford University Press

• B Mulgrew; Applying radial basis functions ;Signal Processing Magazine, IEEE; Volume: 13 ,Issue: 2; 1998

What are we going to cover

Introduction

Soft computing Techniques

NN Architectures

Linear and non-linearly separable

Basis Functions

Regularized RBF; Generalized RBF

RBF Training and Examples

Difference with MLP

Conclusion


Different NN Architectures

• Perceptron (Only one neuron)

– Linear decision boundary

– Limited functionality

• MLP

• RBF

• Recurrent networks

• Self organizing maps

• Many more


Linear and Non-linearly Separable

• Take a 2 input single output

• Plot the each category output in input space using different symbols

• Take inputs in “x-y” plane

• Can you have a line separating the points into 2 categories?

– Yes – linearly separable (OR. AND gate)

– No – Non-linearly separable (EX-OR gate)


Why network models beyond MLN?

• MLN (MLP) was already universal, but…

• MLN (MLP) can have many local minima.

• It is often too slow to train MLN.

• Sometimes, it is extremely difficult to optimize the structure of MLN.

• There may exist other network architecturesin terms of number of elements in each layer…whose performance could be superior to theone used.


Radial Basis Function (RBF) Networks

RBFN are artificial neural networks forapplication to problems of supervised learning:

Regression

Classification


Pragmatic Regression

• Parametric regression-the form of the function is known butnot the parameter values.

• Typically, the parameters (both the dependent andindependent) have physical meaning.

• E.g. fitting a straight

line to a bunch

Of points-


Non-Pragmatic Regression

• No prior knowledge of the true form of the function.

• Using many free parameters which have no physical meaning.

• The model should be able to represent a very broad class of functions.


Classification

• Purpose: assign previously unseen patterns to their respective classes.

• Training: previous examples of each class.

• Output: a class out of a discrete set of classes.

• Classification problems can be made to look like nonparametric regression.


Time Series Prediction

• Estimate the next value and future values of asequence, such as:

• The problem is that usually it is not an explicit functionof time. Normally time series are modeled as auto-regressive in nature, i.e. the outputs, suitably delayed,are also the inputs:

• To create the training set from the available historicalsequence first requires the choice of how many andwhich delayed outputs affect the next output.


Supervised Learning in RBFN

• Neural networks, including radial basis functionnetworks, are nonparametric models and theirweights (and other parameters) have noparticular meaning in relation to the problems towhich they are applied.

• Estimating values for the weights of a neuralnetwork (or the parameters of any nonparametricmodel) is never the primary goal in supervisedlearning.

• The primary goal is to estimate the underlyingfunction (or at least to estimate its output atcertain desired values of the input).


The idea of RBFNN

The MLN is one way to get non-linearity. The other is to use

The generalized linear discriminate function


j

jjwy )(x

The idea of RBFNN

For Radial Basis Function (RBF), the basis function is radial

Symmetry with respect to the input, whose valueis determined by the - distance from the data pointto the RBF center.


M

m

mjmj

j

j

jj

x1

2

2

])([|||| distance,Euclidean For

measure. distance theis ||||

width. the center, therepresents where

)2/||||exp()(

ccx

cx

c

cxx

The Gaussian Kernel

Cover’s Theorem

“A complex pattern-classification problem cast inhigh-dimensional space nonlinearly is more likelyto be linearly separable than in a lowdimensional space”

(Cover, 1965).


Radial Basis Function Networks

• In its most basic form Radial-Basis Functionnetwork (RBF) involves three layers with entirelydifferent roles.

• The input layer is made up of source nodes thatconnect the network to its environment.

• The second layer, the only hidden layer, applies anonlinear transformation from the input space tothe hidden space.

• The output layer is linear, supplying the responseof the network to the activation pattern appliedto the input layer.


The idea of RBFNN

• For RBFNN, we expect that the function to belearnt can be expressed as a linearsuperposition of a number of RBFs.


The function is described as a linear

superposition Of three basis

functions.

RBF Structure

RBFNN: a two-layer network

Free parameters

--The network weights win the 2nd layer

--The form of basis functions

--The number of basis functions

--The location of basis functions.

E.g.: for Gaussian RBFNN, they are the number, the centersand the widths of basis functions


y

x

w

Some Theory

Given a set of N differentpoints {xi Rm0 i=1,2,...,N}and a corresponding set of Nreal numbers {di R1 i=1,2,...,N}, find a functionF:RN->R1 that satisfies theinterpolation condition

F(xi) = di , i=1,2,...,NThe radial-basis functiontechnique consists ofchoosing a function FF(x) = N

i=1 wi (x-xi )


Some Theory

Micchelli’s Theorem

Let {xi}N

i=1 be a set of distinct points in Rm0 Thenthe N-by-N interpolation matrix , whose joy-theelement is ij = (xj-xi) is non-singular.


Regularization Networks

The regularization network is a universal approximator

The regularization network has the best approximation property

The solution computed by the regularization network is optimal.


Generalized RBF Networks

• When N is large, the one-to-one correspondencebetween the training inputdata and the Green’sfunction produces aregularization networkthat may be consideredexpensive. ->

• An approximation of theregularized network.


Generalized RBF Networks

• The approach taken involves searching for suboptimal solution in alower-dimensional space that approximates the regularized solution(Galerkin’s method).

F*(x) = m1 i=1 wi i(x),

where {i(x) | i=1,2,...,m1 N} is a new set of linearly independentbasis functions and the wi constitute a new set of weights.

• We set i(x) = G(x-ti ), i=1,2,... m1 where the set of centers {ti |i=1,2,...,m1} is to be determined.

Note that this particular choice of basis functions is the only thatguarantees that in the case of m1 = N and xi = ti i=1,2,...,N thecorrect solution is consistently recovered.


localized Non-localized

RBF Structure (2)

• Universal approximation: for Gaussian RBFNN, it is capable to approximate any function.


Exact Interpolation

• The idea of RBFNN is that we ‘interpolate’ thetarget function by using the sum of a number ofbasis functions.

• To illustrate this idea, we consider a special caseof exact interpolation, in which the number ofbasis functions M is equal to the number of datapoints N (M=N) and all

• The basis functions are centered at the datapoints. We want the target values are exactlyinterpolated by the summation of basis functions.


Exact Interpolation


tw

cx

or

||)(||

,1for ,

1

M

j

n

j

n

jj

nn

tw

Nnty

Since M=N, is a square matrix and is non-singular for general cases, the result istw

1

RBF Output with 3 centers

1-Dimensional problem

Center location (-1, 0, 1)


RBF Output with 4centres (EX-OR)


σ2= 0.1

σ2= 1.0

RBF Output with 4centres


σ2= 0.1

σ2= 1.0

An example of exact interpolation

For Gaussian RBF (1D input)

21 data points are generated by y=sin(px) plus noise (strength=0.2)


The target data points areindeed exactly interpolated,but the generalizationperformance is not good.

The hybrid training procedure

• The number of basis functions needs not to be equal tothe number of data points. Actually, in a typicalsituation, M should be much less than N.

• The centers of basis functions are no longerconstrained to be at the input data points. Instead, thedetermination of centres becomes part of the trainingprocess.

• Instead of having a common width parameter , eachbasis function can has its own width, which is also tobe determined by learning.


An example of RBFNN


Exact interpolation, =0.1

RBFNN, 4 basis functions, 0.4

The hybrid training procedure

• Unsupervised learning in the first layer. This is to fix thebasis functions by only using the knowledge of inputdata. For Gaussian RBF, it often includes deciding thenumber, locations and the width of RBF.

• Supervised learning in the second layer. This is todetermine the network weights in the second layer. Ifwe choose the sum-of-square error, it becomes aquadratic function optimization, which is easy to solve.

• In summary, the hybrid training avoids to usesupervised learning simultaneously in two layers, andgreatly simplify the computational cost.


Basis function optimization

The form of basis function is predefined, and isoften chosen to be Gaussian.

The number of basis function has often to bedetermined by trials, e.g. though monitoringthe generalization performance.

The key issue in unsupervised learning is todetermine the locations and the widths of basisfunctions.


Algorithms for basis function optimization

Subsets of data points.

• To randomly select a number of input datapoints as basis function centers.

• The width can be chosen to be equal and tobe given by some multiple of the averagedistance between the basis function centers.



Gaussian mixture models.

• The choice of basis functions is essential tomodel the density distribution of the inputdata (intuitively we want the centers of basisfunctions to be at high density regions). Wemay assume input data is generated by amixture of Gaussian distribution. Optimizingthe probability density model returns thebasis function centers and widths.



Clustering algorithms.

• In this approach the input data is assumed toconsist of a number of clusters. Each clustercorresponds to one basis function, with thecenter being the basis function center. Thewidth can be set to be equal to some multipleof the average distance between all centers.


K-means clustering algorithm (1)

• The algorithm partition data points into K disjointsubsets (K is predefined).

• The clustering criteria are:

– The cluster centers are set in the high density regionsof data

– A data point is assigned to the cluster with which ithas the minimum distance to the center

• Mathematically, this is equivalent to minimizingthe sum-of-square clustering function




cluster in points data theofmean the:1

points data containingcluster th the:

where

||||1

2

j

Sn

n

j

j

jj

K

j Sn

j

n

SN

NjS

J

j

j

xc

cx



• Step 1: Initially randomly assign data points to one of Kclusters. Each data point will then have a cluster label.

• Step 2: Calculate the mean of each cluster C.• Step 3:Check whether each data pointed has the right

cluster label. For each data point, calculate its distancesto all K centers. If the minimum distance is not thevalue of this data point in its cluster center, the clusteridentity of this data point will then be updated to theone that gives the minimum distance.

• Step 4: After each epoch checking (one turn for all datapoints), if no updating occurs, i.e., J reaches theminimum value, then stop. Otherwise, go back to step-2.

An example of data clustering


Before clustering After clustering

The network training

• The network output after clustering


termbias the:1)(

clusteringby obtained centers the:

RBFGaussian the:0for ),2/||||exp()(

)()(

0

22

0

x

c

cxx

xx

j

jj

j

K

j

j

j

wy

N

n

nM

j

n

jj twE1

2

0

)(2

1)( w

The error output is

RBF in Time series Prediction

• We will show an example of using RBFNN for timeseries prediction.

• Time series prediction: to predict the systembehavior based on its history.

• Suppose the time course of a system is denotedas{S(1),S(2),…S(n)}, where S(n) is the system stateat time step n. The task is to predict the systembehavior at n+1 based on the knowledge of itshistory. i.e., {S(n),S(n-1),S(n-2),…}. This is possiblefor many problems in which system states arecorrelated over time.


RBF in Time series Prediction

• Consider a simple example, the logistic map, in which the system state x is updated iteratively according to

• Our task is to predict the value of x at any step based on its values in the previous two steps, i.e., to estimate xn based on xn-2 and


)1(1 nnn xrxx

Generating training data from the logistic map

• The logistic map, though is simple, shows many interesting behaviors. (More detail can be found at http://mathworld.wolfram.com/LogisticMap.html

• The data collecting process:• Choose r=4, and the initial value of x to be 0.3

• Iterate the logistic map 500 steps, and collect 100 examples from the last

• 100 iterations (chopping the data into triplets, each triplet gives one input-output pair).


http://mathworld.wolfram.com/LogisticMap.html

Generating training data from the logistic map


The input data space

The time course of the system state

Clustering the input data

• We cluster the input data by using the K-means clustering algorithm.

• We choose K=4. The clustering result returns the centers of basis functions and the scale of width.


The training result of RBFNN


2 and between iprelationsh The nn xx

The training result of RBFNN


1 and between iprelationsh The nn xx

Time series predicted data


Comparison with MLP

RBF• Simple structure: one hidden layer,

linear combination at the output layer

• Simple training: the hybrid training: clustering + the quadratic error function

• Localized representation: the input space is covered by a number of localized basis functions. A given input typically only activate significantly a limited number of hidden units (those are within a close distance)

MLP

• Complicated structure: often many layers and many hidden units

• Complicated training: optimizing multiple layer together, local minimum and slow convergence.

• Distributed representation: for a given input, typically many hidden units will be activated.


Comparison with MLP (2)

• Different ways of interpolating data


MLP: data are classified by hyper-planes. RBF: data are classified according to clusters

Shortcomings of RBFNN

• Unsupervised learning implies that RBFNNmay only achieve a sub - optimal solution,since the training of basis functions does notconsider the information of the outputdistribution.


Example: a basis function ischosen based only on thedensity of input data, whichgives p (x). It does not matchthe real output function h (x).

Shortcomings of RBFNN


Example: the output function is only determined by one input component, theother component is irrelevant. Due to unsupervised, RBFNN is unable to detectthis irrelevant component, whereas, MLP may do (the network weightsconnected to irrelevant components will tend to have smaller values).

Some Theory

The XOR problem: (x1 OR x2) AND NOT (x1 AND x2)


Summary

• The structure of an RBF network is unusual in that the constitution of its hidden units is entirely different from that of its output units.

• Tikhonov’s regularization theory provides a sound mathematical basis for the formulation of RBF networks.

• The Green’s function G (x, ) plays a central role in the theory.


Queries ?????

Documents

RBF Classnote Mtech Spring2013