21
The sole purpose of this paper is to identify which neural network could bring in the great storage efficiency, quality, robustness, pattern completion, content addressable memory of the image (objects) recognition in the traffic signal systems. Most of the pattern mapping neural networks suffer from the drawbacks that during learning of weights, the weigh matrix tends to encode the presently active pattern, thus weakening the trace of patterns it had already learnt. The other problem that the common types of neural networks face is the forceful categorization of a new pattern to one of the already learnt classes. On occasions such categorization seems to be ridiculous as the nearest class of current pattern may be significantly different with respect to the center of the class. The problems of the lack of stability of the weight matrix and forceful categorization of a new pattern to one of the existing classes, has led to the proposal of a new architecture for pattern classification. Neural nets are of interest to researchers in many areas for different reasons. Electronic engineers find numerous applications in signal processing and control theory. Computer engineers are intrigued by the potential for hardware to implement neural nets efficiently and by applications of neural nets to robotics. Computer scientists find that neural nets show promise for difficult problems in areas such as artificial intelligence and pattern recognition. For applied mathematicians, neural nets are a powerful tool for modelling problems for which the explicit form of the relationships among certain variables is not known. There are various points of view as to the nature of a neural net. For example, is it a specialized piece of computer hardware (say, a VLSI chip) or a computer program? We shall take the view that neural nets are basically mathematical models of information processing. They provide a method of representing relationships that is quite different from Turing machines or computers with stored programs. As with other numerical methods, the availability of computer resources, either software or hardware, greatly enhances the usefulness of the approach, especially for large problems. The characteristics of biological neural networks serve as the inspiration for artificial neural networks, or neurocomputing. Artificial neural networks have been developed as generalizations of mathematical models of human cognition or neural biology, based on the assumptions that:

Neural Networks Chapter

Embed Size (px)

DESCRIPTION

Neural Networks Chapte

Citation preview

Page 1: Neural Networks Chapter

The sole purpose of this paper is to identify which neural network could bring in the great storage efficiency, quality, robustness, pattern completion, content addressable memory of the image (objects) recognition in the traffic signal systems.

Most of the pattern mapping neural networks suffer from the drawbacks that during learning of weights, the weigh matrix tends to encode the presently active pattern, thus weakening the trace of patterns it had already learnt. The other problem that the common types of neural networks face is the forceful categorization of a new pattern to one of the already learnt classes. On occasions such categorization seems to be ridiculous as the nearest class of current pattern may be significantly different with respect to the center of the class. The problems of the lack of stability of the weight matrix and forceful categorization of a new pattern to one of the existing classes, has led to the proposal of a new architecture for pattern classification.

Neural nets are of interest to researchers in many areas for different reasons. Electronic engineers find numerous applications in signal processing and control theory. Computer engineers are intrigued by the potential for hardware to implement neural nets efficiently and by applications of neural nets to robotics. Computer scientists find that neural nets show promise for difficult problems in areas such as artificial intelligence and pattern recognition. For applied mathematicians, neural nets are a powerful tool for modelling problems for which the explicit form of the relationships among certain variables is not known.

There are various points of view as to the nature of a neural net. For example, is it a specialized piece of computer hardware (say, a VLSI chip) or a computer program? We shall take the view that neural nets are basically mathematical models of information processing. They provide a method of representing relationships that is quite different from Turing machines or computers with storedprograms. As with other numerical methods, the availability of computer resources, either software or hardware, greatly enhances the usefulness of the approach, especially for large problems.

The characteristics of biological neural networks serve as the inspiration for artificial neural networks, or neurocomputing. Artificial neural networks have been developed as generalizations of mathematical models of human cognition or neural biology, based on the assumptions that:

l. Information processing occurs at many simple elements called neurons.2. Signals are passed between neurons over connection links.3. Each connection link has an associated weight, which, in a typical neural net, multiplies the signal transmitted.4. Each neuron applies an activation function (usually nonlinear) to its net input (sum of weighted input signals) to determine its output signal.

The key characteristics are the net's architecture (pattern of connections between the Neurons) and training algorithm (method of determining the weights on the connections) and its activation function. The weights represent information being used by the net to solve a problem.

Each neuron has an internal state, called its activation or activity level, which is a function of the inputs it has received. Typically, a neuron sends its activation as a signal to several other neurons. It is important to note that a neuron can send only one signal at a time, although that signal is broadcast to several other neurons. For example, consider a neuron Y, illustrated in Figure, that receives inputs from neurons X1, X 2 , and X 3 • The activations (output signals) of these neurons are X1, X 2 , and X 3, respectively. The weights on the connections from X1, X 2 , and X 3 to neuron Y are w1, W2, and W3, respectively. The net input, y_in, to neuron Y is the sum of the weighted signals from neurons X1, X 2 , and X 3, i.e.,

Page 2: Neural Networks Chapter

y_in = w1X 1 + w2X 2 + w3X 3

The activation y of neuron Y is given by some function of its net input, y = f(y-in), e.g., the logistic sigmoid function (an S-shaped curve)

f(x) = 1

1+exp(−x )

Now suppose further that neuron Y is connected to neurons Z I and Z 2, with weights V I and V2, respectively. Neuron Y sends its signal y to each of these units. However, in general, the values received by neurons Z I and Z 2will be different, because each signal is scaled by the appropriate weight, V I and V2. In a typical net, the activations Z I and Z 2 of neurons Z I and Z 2 would depend on inputs from several or even many neurons, not just one.

There is a close analogy between the structure of a biological neuron (i.e., a brain or nerve cell) and the processing element (or artificial neuron) presented in the rest of this book. In fact, the structure of an individual neuron varies much less from species to species than does the organization of the system of which the neuron is an element.A biological neuron has three types of components that are of particular interest in understanding an artificial neuron: its dendrites, soma, and axon. The many dendrites receive signals from other neurons. The signals are electric impulses that are transmitted across a synaptic gap by means of a chemical process. The action of the chemical transmitter modifies the incoming signal (typically, byscaling the frequency of the signals that are received) in a manner similar to the action of the weights in an artificial neural network. The soma, or cell body, sums the incoming signals. When sufficient input is received, the cell fires; that is, it transmits a signal over its axon to other cells. It is often supposed that a cell either fires or doesn't at any instant of time, so that transmitted signals can be treated as binary. However, the frequency of firing varies and can be viewed as a signal of either greater or lesser magnitude. This corresponds to looking at discrete time steps and summing all activity (signals received or signals sent) at a particular point in time. The transmission of the signal from a particular neuron is accomplished by an action potential resulting from differential concentrations of ions on either side of the neuron's axon sheath (the brain's "white matter"). The ions most directly involved are potassium, sodium, and chloride.A generic biological neuron is illustrated in Figure 1.3, together with axons from two other neurons (from which the illustrated neuron could receive signals) and dendrites for two other neurons (to which the original neuron would send signals). Several key features of the processing elements of artificial networks are suggested by the properties of biological neurons, viz., that:

Page 3: Neural Networks Chapter

1. The processing element receives many signals.2. Signals may be modified by a weight at the receiving synapse.3. The processing element sums the weighted inputs.4. Under appropriate circumstances (sufficient input), the neuron transmits a single output.5. The output from a particular neuron may go to many other neurons (the axon branches).

Other features of artificial neural networks that are suggested by biological neurons are:

6. Information processing is local (although other means of transmission, such as the action of hormones, may suggest means of overall process control).7. Memory is distributed:a. Long-term memory resides in the neurons' synapses or weights.b. Short-term memory corresponds to the signals sent by the neurons.8. A synapse's strength may be modified by experience.9. Neurotransmitters for synapses may be excitatory or inhibitory.

Yet another important characteristic that artificial neural networks share with biological neural systems is fault tolerance. Biological neural systems are fault tolerant in two respects. First, we are able to recognize many input signals that are somewhat different from any signal we have seen before. An example of this is our ability to recognize a person in a picture we have not seen before or to recognize a person after a long period of time.Second, we are able to tolerate damage to the neural system itself. Humans are born with as many as 100 billion neurons. Most of these are in the brain, and most are not replaced when they die [Johnson & Brown, 1988]. In spite of our continuous loss of neurons, we continue to learn. Even in cases of traumatic neural loss, other neurons can sometimes be trained to take over the functions of the damaged cells. In a similar manner, artificial neural networks can be designed to be insensitive to small damage to the network, and the network can be retrained in cases of significant damage (e.g., loss of data and some connections). Even for uses of artificial neural networks that are not intended primarily to model biological neural systems, attempts to achieve biological plausibility may lead to improved computational features. One example is the use of a planar array of neurons, as is found in the neurons of the visual cortex, for Kohonen's self-organizing maps The topological nature of these maps has computational advantages, even in applications where the structure of the output units is not itself significant.Other researchers have found that computationally optimal groupings of artificial neurons correspond to biological bundles of neurons [Rogers & Kabrisky, 1989]. Separating the action of a back propagation net into smaller pieces to make it more local (and therefore, perhaps more

Page 4: Neural Networks Chapter

biologically plausible) also allows improvement in computational power (cf. Section 6.2.3) [D. Fausett, 1990]. A unified probabilistic model for independent and principal component analysis (Aapo Hyvarinen)

Principal component analysis (PCA) and independent component analysis (ICA) are both based on a linear model of multivariate data. They are often seen as complementary tools, PCA providing dimension reduction and ICA separating underlying components or sources. In practice, a two-stage approach is often followed, where first PCA and then ICA is applied. Here, we show how PCA and ICA can be seen as special cases of the same probabilistic generative model. In contrast to conventional ICA theory, we model the variances of the components as further parameters. Such variance parameters can be integrated out in a Bayesian framework, or estimated in a more classic framework. In both cases, we find a simple objective function whose maximization enables estimation of PCA and ICA. Specically, maximization of the objective under Gaussian assumption performs PCA, while its maximization for whitened data, under assumption of non-Gaussianity, performs ICA.

The main purposes of a principal component analysis are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information. Here, our desired outcome of the principal component analysis is to project a feature space (our dataset consisting of n d-dimensional samples) onto a smaller subspace that represents our data "well". A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data "best".

Principal Component Analysis (PCA) Vs. Multiple Discriminant Analysis (MDA)

Both Multiple Discriminant Analysis (MDA) and Principal Component Analysis (PCA) are linear

transformation methods and closely related to each other. In PCA, we are interested to find the

directions (components) that maximize the variance in our dataset, where in MDA, we are

additionally interested to find the directions that maximize the separation (or discrimination)

between different classes (for example, in pattern classification problems where our dataset

consists of multiple classes. In contrast two PCA, which ignores the class labels).

In other words, via PCA, we are projecting the entire set of data (without class labels) onto

a different subspace, and in MDA, we are trying to determine a suitable subspace to

distinguish between patterns that belong to different classes. Or, roughly speaking in

PCA we are trying to find the axes with maximum variances where the data is most spread

(within a class, since PCA treats the whole data set as one class), and in MDA we are

additionally maximizing the spread between classes. 

In typical pattern recognition problems, a PCA is often followed by an MDA.

Page 5: Neural Networks Chapter

What is a "good" subspace?

Let's assume that our goal is to reduce the dimensions of a d-dimensional dataset by projecting it

onto a (k)-dimensional subspace (where k<d). So, how do we know what size we should choose

for k, and how do we know if we have a feature space that represents our data "well"?

Later, we will compute eigenvectors (the components) from our data set and collect them in a so-

called scatter-matrix (or alternatively calculate them from the covariance matrix). Each of those

eigenvectors is associated with an eigenvalue, which tell us about the "length" or "magnitude" of

the eigenvectors. If we observe that all the eigenvalues are of very similar magnitude, this is a

good indicator that our data is already in a "good" subspace. Or if some of the eigenvalues are

much higher than others, we might be interested in keeping only those eigenvectors with the

much larger eigenvalues, since they contain more information about our data distribution. Vice

versa, eigenvalues that are close to 0 are less informative and we might consider in dropping

those when we construct the new feature subspace.

2D example

First, consider a dataset in only two dimensions, like (height, weight). This dataset can be plotted

as points in a plane. But if we want to tease out variation, PCA finds a new coordinate system in

which every point has a new (x,y) value. The axes don't actually mean anything physical; they're

combinations of height and weight called "principal components" that are chosen to give one

axes lots of variation.

Eating in the UK (a 17D example)

With multi dimensions, PCA is more useful, because it's hard to see through a cloud of data.

What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the table is the

average consumption of 17 types of food in grams per person per week for every country in the

UK.

The table shows some interesting variations across different food types, but overall differences

aren't so notable. Let's see if PCA can eliminate dimensions to emphasize how countries differ.

Page 6: Neural Networks Chapter

Here's

the plot of the data along the first principal component. Already we can see something is different

about Northern Ireland.

Now, see the first and second principal components, we see Northern Ireland a major outlier.

Once we go back and look at the data in the table, this makes sense: the Northern Irish eat way

more grams of fresh potatoes and way fewer of fresh fruits, cheese, fish and alcoholic drinks. It's

a good sign that structure we've visualized reflects a big fact of real-world geography: Northern

Ireland is the only of the four countries not on the island of Great Britain. (If you're confused

about the differences among England, the UK and Great Britain)

Page 7: Neural Networks Chapter

Independent component analysis (ICA) is a quite powerful technique and is able (in principle) to separate independent sources linearly mixed in several sensors. For instance, when recording electroencephalograms (EEG) on the scalp, ICA can separate out artifacts embedded in the data (since they are usually independent of each other).

ICA is a technique to separate linearly mixed sources. For instance, let's try to mix and then separate two sources. Let's define the time courses of 2 independent sources A(top) and B(bottom)

We then mix linearly these two sources. The top curve is equal to A minus twice B and the bottom the linear combination is 1.73*A +3.41*B.

Page 8: Neural Networks Chapter

We then input these two signals into the ICA algorithm (in this case, fastICA) which is able to uncover the original activation of A and B.

Note that the algorithm cannot recover the exact amplitude of the source activities. Further, also that, in theory, ICA can only extract sources that are combined linearly.

(Matlab Code)

A = sin(linspace(0,50, 1000)); % AB = sin(linspace(0,37, 1000)+5); % Bfigure; subplot(2,1,1); plot(A); % plot Asubplot(2,1,2); plot(B, 'r'); % plot B M1 = A - 2*B; % mixing 1M2 = 1.73*A+3.41*B; % mixing 2figure;subplot(2,1,1); plot(M1); % plot mixing 1subplot(2,1,2); plot(M2, 'r'); % plot mixing 2 figure;c = fastica([M1;M2]); % compute and plot unminxing using fastICA subplot(1,2,1); plot(c(1,:));subplot(1,2,2); plot(c(2,:));

Whitening the data

Page 9: Neural Networks Chapter

We will now explain the preprocessing performed by most ICA algorithms before actually applying ICA.A first step in many ICA algorithms is to whiten (or sphere) the data. This means that we remove any correlations in the data, i.e. the different channels (matrix Q) are forced to be uncorrelated.

Why do that? A geometrical interpretation is that it restores the initial "shape" of the data and that then ICA must only rotate the resulting matrix (see below). Once more, let's mix two random sources A and B. At each time, in the following graph, the value of A is the abscisia of the data point and the value of B is their ordinates.

Let take two linear mixtures of A and B and plot these two new variables.(Matlab Code)

POINTS = 1000; % number of points to plot % define the two random variables% -------------------------------for i=1:POINTS A(i) = round(rand*99)-50; % A B(i) = round(rand*99)-50; % Bend;figure; plot(A,B, '.'); % plot the variablesset(gca, 'xlim', [-80 80], 'ylim', [-80 80]); % redefines limits of the graph % mix linearly these two variables% --------------------------------M1 = 0.54*A - 0.84*B; % mixing 1M2 = 0.42*A + 0.27*B; % mixing 2figure; plot(M1,M2, '.'); % plot the mixingset(gca, 'ylim', get(gca, 'xlim')); % redefines limits of the graph % withen the data% ---------------x = [M1;M2];

Page 10: Neural Networks Chapter

c=cov(x') % covariancesq=inv(sqrtm(c)); % inverse of square rootmx=mean(x'); % meanxx=x-mx'*ones(1,POINTS); % subtract the meanxx=2*sq*xx; cov(xx') % the covariance is now a diagonal matrixfigure; plot(xx(1,:), xx(2,:), '.'); % show projections% ----------------figure; axes('position', [0.2 0.2 0.8 0.8]); plot(xx(1,:), xx(2,:), '.'); hold on;axes('position', [0 0.2 0.2 0.8]); hist(xx(1,:)); set(gca, 'view', [90 90]);axes('position', [0.2 0 0.8 0.2]); hist(xx(2,:)); % show projections% ----------------figure; axes('position', [0.2 0.2 0.8 0.8]); plot(A,B, '.'); hold on;axes('position', [0 0.2 0.2 0.8]); hist(A); set(gca, 'view', [90 90]);axes('position', [0.2 0 0.8 0.2]); hist(B);

Then if we whiten the two linear mixtures, we get the following plot

Page 11: Neural Networks Chapter

the variance on both axis is now equal and the correlation of the projection of the data on both axis is 0 (meaning that the covariance matrix is diagonal and that all the diagonal elements are equal). Then applying ICA only mean to "rotate" this representation back to the original A and B axis space.

The whitening process is simply a linear change of coordinate of the mixed data. Once the ICA solution is found in this "whitened" coordinate frame, we can easily reproject the ICA solution back into the original coordinate frame.

The ICA algorithm

Intuitively you can imagine that ICA rotates the whitened matrix back to the original (A,B) space (first scatter plot above). It performs the rotation by minimizing the Gaussianity of the data projected on both axes (fixed point ICA). For instance, in the example above,

Page 12: Neural Networks Chapter

The projection on both axis is quite Gaussian (i.e., it looks like a bell shape curve). By contrast the projection in the original A, B space far from gaussian.

By rotating the axis and minimizing Gaussianity of the projection in the first scatter plot, ICA is able to recover the original sources which are statistically independent (this property comes from the central limit theorem which states that any linear mixture of 2 independent random variables is more Gaussian than the original variables). In Matlab, the function kurtosis (kurt() in the EEGLAB toolbox; kurtosis() in the Matlab statistical toolbox) gives an indication of the gaussianity of a distribution (but the fixed-point ICA algorithm uses a slightly different measure called negentropy).

The Infomax ICA in the EEGLAB toolbox (Infomax ICA) is not as intuitive and involves minimizing the mutual information of the data projected on both axes.

However, even if ICA algorithms differ from a numerical point of view, they are all equivalent from a theoretical point of view

ICA in N dimensions

We dealt with only 2 dimensions. However ICA can deal with an arbitrary high number of dimensions. Let's consider 128 EEG electrodes for instance. The signal recorded in all electrode at each time point then constitutes a data point in a 128 dimension space. After whitening the data, ICA will "rotate the 128 axis" in order to minimize the Gaussianity of the projection on all axis (note that unlinke PCA the axis do not have to remain orthogonal).What we call ICA components is the matrix that allows projecting the data in the initial space to one of the axis found by ICA. The weight matrix is the full transformation from the original space. When we write

S = W X

X is the data in the original space. For EEG

Time pointsElectrodes 1 [ 0.134 0.424 0.653 0.739 0.932 0.183 0.834 ....]

Page 13: Neural Networks Chapter

Electrodes 2Electrodes 3

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....][ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]

For fMRI

VoxelsTime 1Time 2Time 3

[ 0.134 0.424 0.653 0.739 0.932 0.183 0.834 ....][ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....][ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]

S is the source activity.

In EEG: An artifact time course or the time course of the one compact domain in the brain

Time pointsComponent 1Component 2Component 3

[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....][ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....][ 0.153 0.734 0.134 0.324 0.654 0.739 0.932 ....]

In fMRI: An artifact topography or the topography of statistically maximally independent pattern of activation

W is the weight matrix to go from the S space to the X space.

Now the rows of W are the vector with which we can compute the activity of one independent component. To compute, the component activity in the formula S = W X, the weight matrix W is defined as (note if the linear transformation between X and S is still unclear (that is if you do not know how to perform matrix multiplication), look up this book is a good starting point).

Component 1Component 2Component 3

elec1 elec2 elec3 elec4 elec5

[ 0.824 0.534 0.314 0.654 0.739 ...]

[ 0.314 0.154 0.732 0.932 0.183 ...]

[ 0.153 0.734 0.134 0.324 0.654 ...]

For instance to compute the activity of the second source or second independent component (in a matrix multiplication format), you may simply multiply matrix X (see beginning of paragraph) by the row vector

Component 2 elec1 elec2 elec3 elec4 elec5

[ 0.314 0.154 0.732 0.932 0.183 ...]

Page 14: Neural Networks Chapter

Now you have the activity of the second component, but the activity is unitless. If you have heard of inverse modeling, the analogy with EEG/ERP sources in dipole localization software is the easiest to grasp. Each dipole has an activity (which project linearly to all electrodes). The activity of the Brain source (dipole) is unitless unless it is projected to the electrodes. So each dipole create a contribution at each electrode site. ICA components are just the same. Now we will see how to reproject one component to the electrode space. W-1 is the inverse matrix to go from the source space S to the data space X.

X = W-1S

In Matlab you would just type inv(W) to obtain the inverse of a matrix.

Electrode 1Electrode 2Electrode 3

comp1 comp2 comp3 comp4 comp5

[ 0.184 0.253 0.131 0.364 0.639 ...]

[ 0.731 0.854 0.072 0.293 0.513 ...]

[ 0.125 0.374 0.914 0.134 0.465 ...]

If S is a row vector (for instance the activity of component 2 computed above) and we multiply it by the following column vector from the inverse matrix above

Electrode 1Electrode 2Electrode 3

comp2

[ 0.253 ]

[ 0.854 ]

[ 0.374 ]

We will obtain the projected activity of component 2 (the inverse weights for component 2 (column vector; bottom left below) multiplied by the activity for component 2 (row vector; top right below) leads to the component projection (matrix; bottom right).

(on the rigth one row of the S matrix (the activity of component 2)

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]

[ 0.253 ][ 0.854 ][ 0.374 ]

[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]

[ 0.153 0.734 0.134 0.324 0.654 0.739 0.932 ....]

Page 15: Neural Networks Chapter

(above is the projection of one component activity on all the electrodes (note that the calculusnot accurate and that the numbers are meaningless).This matrix will be denoted XC2.

Now, if one want to remove component number 2 from the data (for instance if component number 2 proved to be an artifact), one can simply subtract the matrix above (XC2) from the original data X.Note that in the matrix computed above (XC2) all the columns are proportional, which mean that the scalp activity is simply scaled. For this reason, we denote the columns of the W-1 matrix, the scalp topography of the components. Each column of this matrix is the topography of one component which is scaled in time by the activity of the component. The scalp topography of each component can be used to estimate the equivalent dipole location for this component (assuming the component is not an artifact).

As a conclusion, when we talk about independent components, we usually refer to two concepts

Rows of the S matrix which are the time course of the component activity Columns of the W-1 matrix which are the scalp projection of the components

ICA properties

From the preceding paragraphs, several properties of ICA becomes obvious

ICA can only separate linearly mixed sources. Since ICA is dealing with clouds of point, changing the order in which the points are plotted

(the time points order in EEG) has virtually no effect on the outcome of the algorithm. Changing the channel order (for instance swapping electrode locations in EEG) has also no

effect on the outcome of the algorithm. For EEG, the algorithm has no a priori about the electrode location and the fact that ICA components can most of the time be resolved to a single equivalent dipole is a proof that ICA is able to isolate compact domains of cortical synchrony.

Since ICA separates sources by maximizing their non-Gaussianity, perfect Gaussian sources cannot be separated

Even when the sources are not independent, ICA finds a space where they are maximally independents.

Signal Mixtures

We know that signal mixtures tend to have Gaussian (normal) probability density functions, and that source signals have non-gaussian pdfs. We also know that each source signal can be extracted from a set of signal mixtures by taking the inner product of a weight vector and those signal mixtures where

Page 16: Neural Networks Chapter

this inner product provides an orthogonal projection of the signal mixtures. But we do not yet know precisely how to find such a weight vector. One type of method for doing so is exploratory projection pursuit, often referred to simply as projection pursuit. Projection pursuit methods seek one projection at a time such that the extracted signal is as non- gaussian as possible. This contrasts with ICA, which typically extracts M signals simultaneously from M signal mixtures, which requires estimating a (possibly very large) M x M unmixing matrix. One practical advantage of projection pursuit over ICA is that less than M signals can extracted if required, where each source signal is extracted from M signal mixtures using an M element weight vector.

The name projection pursuit derives from the fact that this method seeks a weight vector which provides an orthogonal projection of a set of signal mixtures such that each extracted signal has a pdf which is as non-gaussian as possible.

Let us consider the example of human height. Suppose that the height of an individual h i is the outcome of many underlying factors which include a genetic component S i

G and dietary component Si

D (i.e. nature vs nurture). Let us further suppose that the contribution of each factor to height is the same for all the individuals (i.e. the nature/Nurture ratio is fixed). Finally we need to assume that the total effect of these different factors in each individual is the sum of their contributions. If we consider the contribution of each factor as a constant coefficient then we can write

Hi = aSiG + bSi

D

Where a and b are non-zero coefficients. Each coefficient determines how height increases with the factors Si

G and SiD. Note that Si

G and SiD vary across individuals, whereas the coefficient a and b are

the same for all individual. The central limit theorem ensures that the pdf of h i value is approximately gaussian irrespective of the pdf of S i

G or SiD values and irrespective of the constants a

and b.

Of course, we should recognize above equation for what it is: the formation of a signal mixture h by a linear combination of source signals Si

G and SiD, using mixing coefficients a and

b. Note that hi could equally well be mixture of two voice signals.

As a further example, in signal processing it is almost always assumed that, after tte signals of interest have been extracted from noisy stream of data, the residual noise is gaussian. As stated above, this assumption is mathematically very convenient, but it is also usually valid. If the residual noise is the result of many processes whose outputs are added together then the central limit theorem (CLT) guarantees that this noise is indeed approximately gaussian.

Gaussian Signals: Good News, Bad News

The bad news is that the converse of the CLT is not true in general; that is, it is not true that any gaussian signal is a mixture of non-gaussian signals. The good news is that, in practice, gaussian signals often do consist of a mixture of non- gaussian signals. This is good news because it means we can treat any gaussian signal as if it consists of a mixture of non- gaussian source signals. Given a set of such gaussian mixtures, we can then proceed to find each source signal by finding that unmixing vector which extracts the most non- gaussian signal by finding that unmixing vector which extracts the most non- gaussian signal from the set of mixtures.

We could now precede using two different strategies. We could define a measure of the distance between the signal extracted by a given unmixing vector and a gaussian signal, and then

Page 17: Neural Networks Chapter

find the unmixing vector that maximizes this distance. This distance is known as “kullback – Leibler divergence”. A simpler strategy consists of defining a measure of non- gaussianity and then finding the unmixing vector that maximizes this measure.

The fact that there are actually two types of non- gaussian signals will not detain us long, because we shall assume that our source signals are of one type only. The two types are known by various terms, such as super-gaussian and sub-gaussian or equivalently as playkurtoic and leptokurtoic resp. and signal with zero kurtois is mesokurtotic. A signal; with a super gaussian pdf has most of its values clustered around zero whereas a signal with a sub gaussian pdf does not . As examples, a speech signal has a super gaussian pdf and a sawtooth function and white noise have sub gaussian pdfs. This implies that super gaussian signals have pdfs that are more peaky than that of a gaussian signal, whereas a sub gaussian signal has a pdf that is less peaky than that if a gaussian signal