Unsupervised Neural Networks Learning Rules, BCM Learning Rule

)ACULTY OF 0ATHEMATICS AND 3HYSICS

&OMENIUS 8NIVERSITY� %RATISLAVA

,NSTITUTE OF ,NFORMATICS

8NSUPERVISED 1EURAL 1ETWORKS /EARNING 5ULES� %&0

/EARNING 5ULE AND ,TS &OMPUTATIONAL 3ROPERTIES

Diploma Thesis

$UTHOR� 3DYHO 3HWURYLþ

$DVISOR� 51'R� �XELFD %H�XãNRYi� &6F�

ýHVWQH SUHKODVXMHP� åH VRP GLSORPRY~ SUiFX Y\SUDFRYDO ViP� LED V SRXåLWtP

citovaných zdrojov a literatúry.

podpis

1

$FNQRZOHGJHPHQWV

I would like to give my appreciation to my advisor RNDr. �XELFD %H�XãNRYi�

CSc. She provided me with all necessary information and references on BCM and a

lot of help. She created a very friendly and cooperative atmosphere, and invited me to

take part in scientific seminars held at the Department of Computer Science and

Engineering of Slovak University of Technology.

0DQ\ WKDQNV EHORQJV DOVR WR KHU FROOHDJXH ,QJ� 3HWHU 7L�R� &6F�� ZKR ZDV D

generous source of many ideas, had a strong influence on the direction of our work in

most crucial momements and who provided me with Information Theory references

and software utilities.

I am also grateful to my tutor PhDr. Ján Šefránek, DrSc., who organized the

diploma thesis seminar, where we regulary discussed the progress in our work on

thesis.

2

7DEOH RI FRQWHQWV�

A&.12:/('*(0(176 .............................................................................................1

T$%/( 2) &217(176. ...............................................................................................2

1. I1752'8&7,21 ....................................................................................................3

2. U1683(59,6(' L($51,1*R8/(6 ........................................................................5

2.1. Unsupervised Hebbian Learning ...................................................................6

2.2. Oja's Rule .....................................................................................................7

2.3. Principal component analysis (PCA).............................................................8

2.4. One-Layer Feed-Forward Networks..............................................................8

2.5. Self-Organizing Feature Extraction with Hebbian learning...........................9

2.6. Unsupervised Competitive Learning............................................................10

3. BCM L($51,1*R8/(.......................................................................................12

3.1. Introduction ................................................................................................12

3.2. Basic concepts of the BCM Theory..............................................................12

3.3. Experiments with the BCM Neuron and Time Sequences.............................15

3.3.1. Introduction ..........................................................................................15

3.3.2. Theoretical background.........................................................................17

Figure 3.3.b: An example of the HMM automaton (automaton definition). .....21

3.3.3. Implementation.....................................................................................21

3.3.4. Results ..................................................................................................28

R(680( ................................................................................................................38

R()(5(1&(6..........................................................................................................39

A33(1',;A, 6285&( &2'( $1' (;$03/(6 2) '$7$ ),/(6. ...................................41

Program source code.........................................................................................41

A33(1',;B, (17523,& 63(&75$ ...........................................................................41

3

�� ,QWURGXFWLRQ

One of the first (and often cited) works regarding neural networks, "The

Organization of Behavior" by Donald Hebb [Hebb49] introduces a basic principle for

updating synaptic strengths: "When an axon of cell A ... excite(s) cell B and

repeatedly or persistently takes part in firing it, some growth process or metabolic

change takes place in one or both cells so that A's efficiency as one of the cells firing

B is increased". It is not yet completely revealed which underlying processes are

responsible for described changes in the real biological neural networks.

Although detailed mathematical models of neurons that used differential

equations have been created, they were not acceptable in terms of computational

complexity. In addition, they contained too many modifiable parameters. One of the

primary challenges of artificial intelligence of nowadays is to develop an efficient

model of neuron as clearly stated by Rodney A. Brooks in [Selman96].

Many simplifying models have been developed and they provided a large area for

formal research. This led to an establishment of a new independent area, artificial

neural networks, which, although originally biologically motivated, went far from the

real biological neural networks. Their classification and pattern recognition features

approach statistics, theory of learning and information.

The central topic in artificial neural networks is the learning process, which is

driven primary by the learning rule. The basic learning rules that have been suggested

have been exhaustively analyzed in detail. Most of their limitations have been

uncovered and it seems that the simplifications inherently contained in these models

imply in most cases too strong constraints. That's why is it reasonable to explore

learning rules based on different principles and this is the motivation of the second

part of this thesis. The hope is that this laborious and iterative process will converge

into more biologically plausible and efficient rule with either general or highly

specific purpose.

Two main approaches of the learning rules and network architectures can be

identified: unsupervised and supervised learning. Supervised learning requires a large

set of training data and it is usually not very adaptable to the changes of the

circumstances. Once the training stage is finished, the network learning ceases. On the

other hand, outputs of networks with unsupervised learning are of different kind. They

4

provide either some statistical information, perform a useful transformation or

anonymous classification or clustering, thus making it more difficult to interpret the

outputs directly. To make use of advantages of both approaches, several mixed

models with both supervised and supervised learning have been designed.

The thesis focuses on the unsupervised learning rules. The first part of the thesis

gives an overview of unsupervised learning rules. Their particular characteristics,

usage and limitations are summarized. The second part focuses on the BCM learning

rule suggested by Bienenstock, Cooper and Munro. We give an overview of the BCM

Theory. We design, perform and discuss several experiments with symbolic time

sequences in order to analyze computational properties of the learning rule. The

source code of the programs, which we used in the experiments, is included in

Appendix A. The entropy spectra, which we have computed for two symbolic

sequences, are in Appendix B.

5

�� 8QVXSHUYLVHG /HDUQLQJ 5XOHV�

We can understand a neural network (even consisting only of a single neuron) as

a projection )� , → 2 of input vector [L∈, from the space of input vectors , to the

space of possible output vectors )�[L�∈2, but keeping in mind that ) is dynamic and

may change while inputs are processed. In unsupervised learning, ) is updated

without external intervention. Changes are implied only by the inputs. The network

must find the regularities, categories, correlation or patterns in the input and organize

itself to provide useful codes on the output side. Unsupervised learning is possible

only thanks to the redundancy in the input data. Otherwise, it would not be possible to

differentiate between the input with information and random noise. According to

[Hertz91], output of the unsupervised neural network may represent different features

of the input:

• )DPLOLDULW\ - how familiar the input pattern is to the typical or average patterns

seen in the past. The network gradually learns what is typical.

• 3ULQFLSDO &RPSRQHQW $QDO\VLV - the similarity to the previous examples is

measured along the set of axes.

• &OXVWHULQJ - input vectors are assigned to belong to a certain category and the

output of the network is a binary vector with only one bit identifying the

category set. Each cluster of similar or nearby patterns would then be classified

as a single output class.

• 3URWRW\SLQJ - the input data are classified to categories but for each input the

network outputs a prototype of the category that input belongs to.

• (QFRGLQJ - output could be an encoded version of the input, in fewer bits, keeping

as much relevant information as possible. This could be used for data

compression, assuming that an inverse decoding network could also be

constructed.

• )HDWXUH 0DSSLQJ - output units may be seen as in a fixed geometrical

arrangement with only one on at a time mapping input patterns to different

points in this arrangement. The topographic map of the input is constructed and

�7KH IRUPXODV LQ WKLV FKDSWHU DUH EDVHG RQ >+HUW]��@�

6

similar input patterns are projected to nearby output units. A global organization

of the output units evolves.

These cases are not distinct and might be combined. Encoding can be performed

using principal component analysis or by clustering called also vector quantization.

Principal component analysis can be used also for dimensionality reduction before

performing clustering or feature mapping avoiding "curse of dimensionality". Another

usage of unsupervised learning is to replace supervised learning, when possible either

because of the computationally complexity of supervised networks or to provide a

mechanism for network to adapt to changes.

Unsupervised neural networks tend:

• to be of a simple architecture (complexities lie in the learning rules)

• not to have many layers (usually only one)

• to contain many fewer output units than inputs (with exception of the Feature

Mapping)

• to be more biologically plausible than other kinds of networks.

We can separate unsupervised networks into 2 main categories: based on the

modified Hebb rule and based on competitive learning. We will describe them in 2

different subsections.

�� 8QVXSHUYLVHG +HEELDQ /HDUQLQJ

We consider input vectors [L with components xij for j=1..N and with the

probability distribution P([L). The single linear output unit (see Figure 2.1) is the

simplest example and the formula giving the output is:

(2.1)

where Z is the weight vector.

Figure 2.1. Plain Hebbian learning architecture.

w�

w�

w1

xL� xL� xL1

)�[t�

∑=

===1

M

7L7L

MM[Z

�

�� Z[[Z[)LL

7

Since the output )�[L� has to be a measure of similarity, a plain Hebbian learning

rule may be applied:

∆wM=η)�[L�xi

j (2.2)

where η is the learning rate. Frequent input patterns strengthen the weights most

and thus they produce the largest output. The problem with this rule is that the

weights keep growing forever. This can be fixed. We suppose that there exist an

equilibrium point for Z. At equilibrium, we want the changes to Z to be 0 on average

and thus:

0 = ⟨∆wj⟩ = ⟨)�[,� xI

j⟩ = ⟨�kwkxIkx

Ij⟩ = �k&MNwk

⟨∆Z⟩ = Z&. (2.3)

where & is a symmetric correlation matrix defined as:

&MN≡⟨xLMx

LN⟩ (2.4)

or

&MN≡⟨[L[L T⟩. (2.5)

If there were the equilibrium Z, it would be the eigenvector of & with eigenvalue

0, but & has some positive eigenvalues and fluctuation having a component along an

eigenvector with positive eigenvalue would grow exponentially. The direction of the

eigenvector with the largest eigenvalue λPD[ of & would become dominant and Z will

approach this direction with increasingly huge norm.

�� 2MDV 5XOH

We can restrict the growth of weight vector Z in the plain Hebbian learning for

example by normalization after each change wj' =α wj with such α that |Z|=1.

Another approach suggested by Oja in 1982 is to use modified learning rule:

∆wM=η)�[L��xi

j-)�[L�wL

M). (2.6)

It can be proven, that Z approaches length of one and tends to an eigenvector

with the largest eigenvalue.

There exist different ways of modification of the plain Hebbian rule. One way is

to simply cut components of the weight vector Z so that they remain in some interval

w- ≤ wi ≤ w+ [Linsker86]. Another modification by Yeuille at al. uses the rule:

∆wM=η()�[L� xi

j � wiM|Z|�). (2.7)

8

PD[λ=Z

�Z&Z7Z

MM

Z

MNN

ZM

ZMN

&(�

�

�

�

��

�

�

�

�

� −−=

∑+∑−=

Here Z converges to the eigenvector with the largest eigenvalue with

instead of |Z|=1. Main advantage of the Oja's rule is that there is an associated cost

function

(2.8)

�� 3ULQFLSDO FRPSRQHQW DQDO\VLV �3&$�

Linsker reports [Linsker86] that performing principal component analysis is

equivalent to maximizing the information content of the output signal in situations

where that has a Gaussian distribution. The aim is to find in a data space set of M

orthogonal vectors that account for as much as possible of the data variance. The

original data might then be projected from their original N-dimensional space to the

M-dimensional subspace (usually M << N) performing a dimensionality reduction

retaining most of the information in the data. The kth principal is taken to be along the

direction with the kth maximum variance and it can be shown, that this component

corresponds to the eigenvector direction belonging to the kth largest eigenvalue of the

full covariance matrix ⟨(xij -µj)(x

ik -µk)⟩, where µ j=⟨ xi

j⟩. For zero-mean data this

reduces to the corresponding eigenvectors of the correlation matrix & defined above

and it is always possible to center data to achieve the zero mean.

�� 2QH�/D\HU )HHG�)RUZDUG 1HWZRUNV

Oja's rule can be used to find the first principal component in zero-mean data. To

find M principal components we can use a one-layer feed forward network designed

by Sanger in 1989 [Sanger89] or Oja in 1989 [Oja89]. The output )([L) is given by the

formula:

(2.9)

where ZM is the vector for the j WK output. Both Sanger's and Oja's learning rule is:

(2.10)

but the upper limit of the sum is j for Sanger and M for Oja. The main difference is

that Sanger's rule finds exactly the first M principal components, whereas Oja's rule

∑=

===1

N

77L

MMN [Z�

�� M

LL

M

L

M Z[[Z[)

∑=

−=∆�

��O

ON

L

NMN Z[ZL

O

L

M[)[)η

9

finds M vectors that span the same subspace as the first M eigenvectors, but they are

not the same and depend on the initial conditions. These rules are not local, since

updating the weight wij requires information from other nodes other than input k and

output j. There exists a modification of the Sanger rule which is local:

(2.11)

Other network architectures are used for principal component analysis too. Self-

supervised back-propagation network with N inputs, N outputs and one hidden layer

of M < N units trained so, that the outputs are as close as possible to the inputs in the

training set produces the same result as the Oja's M-unit rule.

Another architecture designed by Rubner and Tavan in 1989 contains one-layer

network with trainable lateral connections between the M output units (lateral

connections exist only "from the left to the right" or between all units in similar

Földiák's approach). The ordinary weights are trained with the plain Hebbian rule

with renormalization to the unit length and the lateral weights are trained with anti-

Hebbian learning:

∆uMN=-γ )M�[L�)N�[

L� (2.12)

This architecture extracts M principal components as the Sanger's rule, and the lateral

connection converges to zero.

�� 6HOI�2UJDQL]LQJ )HDWXUH ([WUDFWLRQ ZLWK +HEELDQ OHDUQLQJ

In the feature extraction the goal is to have many output units, each one most

sensitive to a particular input and different output units choose different input

patterns. The number of the output units may be larger than the number of input units.

This can be measured by selectivity of a particular output Oi (as defined by

Bienenstock in 1982):

SelectivityM = 1 - ⟨Oi⟩ / max Oi (2.13)

where the average ⟨Oi⟩ and the maximum are both taken over all the possible inputs.

The selectivity is near 1, if the output unit favors single input (or a narrow range of

inputs) and it is near zero, if it responds equally to all inputs.

The aim is to define an architecture and learning rule in which the outputs

converge to high selectivity and to have different output units to become sensitive to

∑−

=−−=∆

�

�

`��^��M

O

MN

L

MMN

L

O

L

N

L

MMN Z[)Z[)[[)Z η

10

different input patterns with some output unit matched to every input pattern. Another

goal is that the similar input patterns should activate nearby output units arranged in

geometrical structure.

The area of feature mapping is very challenging also because of the experimental

biological evidence in the animal visual cortex.

An interesting model consisting of three layers A, B and C where units are

connected with the feed-forward connections with the neighboring units in the

previous layer (called receptive field) has been designed by Linsker in 1986

[Linsker86]. Output O of the particular unit receiving input from K units, Ij, j=1..K is:

O = a + �wkVk (2.14)

where a is an optional threshold and the sum runs from k=1 to K. Since the units are

linear, the result can be replaced by a network with just one layer, but the layers are

important for the learning rule with tunable parameters b, c and d:

∆wM=η()M�[L�O + b )M�[

L� + cO + d) (2.15)

Then the weights are clipped to the range w- ≤ wi ≤ w+ � In addition, this rule tries to

maximize output variance with the constraint �wj = �. (� is a constant - combination

of constants a-d).

�� 8QVXSHUYLVHG &RPSHWLWLYH /HDUQLQJ

The main principle in the competitive learning is that only one output unit or only

one per group is activated at a time. The output units compete for activation and this is

where the name for this type of learning comes from: winner-takes-all.

The aim of these architectures is to cluster or categorize the input data. Similar

input patterns are classified into the same category. The network has to find the

classes based on the correlation of the input data. The possible uses include any

categorization in AI, data encoding and compression, function approximation, image

processing, statistical analysis and combinatorial optimization. There are several

disadvantages for winner-takes-all architectures:

• The output code is not very effective, since one output cell represents one

category and N units can represent only N categories instead of possible 2N.

• These architectures are not robust to degradation or failure, if one output unit fails,

then we lose the whole category.

11

• Hierarchical knowledge cannot be represented, only one level of categorization is

possible with the winner-takes-all method.

A similar approach called feature mapping develops spatial organization in the

output layer. These fields are intertwined and should be examined together.

Probably the most important contribution in this area was made by prof. Teuvo

Kohonen who designed several competitive learning architectures – most important is

the Self-Organizing (Feature) Map algorithm (SOM). The input vector is sent in

parallel to all neurons distributed in the single layer. The activation of each node is the

Cartesian product of the input vector with the input weight vector, which is specific

for each node. The node with the highest activation (or the weight vector closest to the

input vector in Euclidean Distance version) is a winner and its index is the output of

the network. The weight vectors of the winner and its topological neighbors are

updated: close neigboring nodes move their weight vector towards the input vector. At

the same time, nodes that are distant from the winner are inhibited and move their

weight vector in the opposite direction. The result is the topological map of the input

space.

We have already mentioned the difficulty with applying unsupervised learning to

classification problems. The solution is provided by Kohonen’s supervised LVQ

algorithms, which are responsible for tuning the weights of the SOM layer in order to

minimize the number of wrong classifications. Each output node belongs to one of the

possible categories.

Applications of the competitive learning include speech recognition, robotics and

various applications of topological mapping. See [Kohonen95] for more detailed list

of applications.

Recent Kohonen’s work includes also the work on the physiological

interpretation of the SOM Algorithm [Kohonen93]. It is argued that the pure

transsynaptic communication between the neurons occurs. It is suggested that this

proces is based by a diffuse chemical effect.

12

�� %&0 /HDUQLQJ 5XOH

�� ,QWURGXFWLRQ

The area of artificial neural networks was originally motivated by biological

neural networks. It declined later from the original direction and converged into a

theory that provides reasonable tools for machine learning, in case of the supervised

architectures, and many computational procedures applicable in the field of statistics,

in case of unsupervised architectures. BCM learning rule, named after Bienenstock,

Cooper and Munro, has a similar and long history.

In this part of the thesis, we will give a detailed overview of the BCM

bibliography. In addition to a general view, we have divided the published material

into three categories. First category is strictly biologically motivated and usually

based on real biological experiments. Here, the BCM serves as a simplified

mathematical model of animal neural cell in visual or barrel cortex. The second

category includes other applications of the BCM learning rule and its modifications,

particularly the projection pursuit. The last category contains papers describing and

analyzing the mathematical details of the BCM theory. This overview is followed by a

next part concerning the experiments, which we performed with the BCM learning

rule. Its introduction outlines the purpose of experiments. Followed is by the section,

which summarizes a necessary theory, and by a section with implementation details.

The last section discusses the results.

�� %DVLF FRQFHSWV RI WKH %&0 7KHRU\

Original motivation came from the real experiments with the visual cortex of

animals. In 1975, Nass and Cooper [Nass75] explored a theory in which the

modification of visual cortical synapses was purely Hebbian.

Then a significant extension of theory was presented by Cooper, Liberman and

Oja in 1979 [Cooper79]. It was presented as a theoretical solution to the problem of

visual cortex plasticity. The main idea was that the sign of weight modification should

be based on whether the postsynaptic response is above or below a threshold.

Responses above the threshold should lead to strengthening of the active synapses,

responses below the threshold lead to weakening of the active synapses.

13

��F0

=θ

Neurons in the primary visual cortex of a normal adult cat are sharply tuned to

the orientation of an elongated slit of light and most are activated by stimulation of

either eye as stated by Hubel and Wiesel in 1959 [Hubel59]. Both of these properties -

orientation selectivity and binocularity depend on the type of visual environment

experienced during a critical period of early postnatal development [Intrator92].

Several striking effects appear after not normal postnatal development. The theoretical

solution for the plasticity of visual cortex was presented by Cooper, Liberman and Oja

in [Cooper79].

The original theory used a modification threshold θm that was static. Bienenstock,

Cooper and Munro suggested [Bienenstock82] that this value varies as a nonlinear

function of the average output of the postsynaptic neuron, which is the main concept

of the present BCM model. This provides stability properties and explains several

important effects. The form of synaptic modification is�:

mj = φ (c,θm(t)) dj (3.1)

where mj is the efficacy of the jth Lateral Geniculate Nucleus (LGN) synapse onto a

cortical neuron (i.e. the input weight), dj is the level of presynaptic activity of the jth

LGN afferent (i.e. the input), c is the level of activation or the postsynaptic activity of

the postsynaptic neuron (i.e. the output), which is given (in the linear region), by md,

and θm is a nonlinear function of cell activity averaged over some time that in the

original BCM formulation was proposed as

(3.2)

The dynamic modification threshold θm(t) is a nonlinear function of the time-averaged

postsynaptic activity c(t), so that

(3.3)

where c0 is a positive scaling constant (originaly, ⟨c(t)⟩τ2 was used instead of ⟨c2(t)τ⟩).

The averaged cell activity over some recent past ⟨c2(t)⟩τ is determined for example by:

(3.4)

�$OO HTDWLRQV LQ WKLV VHFWLRQ DUH EDVHG RQ >,QWUDWRU��@�

( )( )

��

�

=

F

WF

WP

τθ

∫∞−

−

−=

W WW

GWHWF2

WF ��

��

��

�� ττ

14

where τ is the averaging period. From these relations it follows that when the

postsynaptic activity c(t) is greater than zero but less than the modification threshold

θm�all active synapses (i.e. di(t) > 0) weaken. On the other hand, when postsynaptic

activity c(t) is greater than θm�all active synapses potentiate. Since c(t)=�mi(t)di(t), the

correlation of excitatory inputs plays a crucial role in driving the postsynaptic cell

activity above the modification threshold θm, The key property of θm is that it is not

fixed, its current value is proportional to the postsynaptic response averaged over

some recent past time.

The shape of the function φ for different θm is drawn in the Figure 3.1.

Figure 3.1. The φ function for two different values of threshold θm. Usually,φ takes the

form of parabola, i.e. φ = c.(c- θm).

The BCM model has a great biological relevance, details can be found for example in

[Bear87]. The group of researchers working with the BCM theory is currently making

a lot of effort in this area. The model has been applied also to a rat barrel cortex by

Be�XãNRYi HW DO� [Be�XãNRYi��] for modeling the neuron after whisker sparing. The

proposed model has been further extended with inhibitory cells in [Be�XãNRYi97].

The important aspect of this work is that the theory is usually compared with the

experiment as in [Clothiaux91] where such process in the visual network as

monocular deprivation (MD), normal rearing (NR), reverse suture (RS), strabismus

(ST), binocular deprivation (BD) and recovery from deprivation (RE) are explained

using the BCM model. A neural network of visual cortex has also been modeled and

its stability and fixed points of synapses were analyzed in [Cooper88]. The BCM

Theory was mathematically analyzed from various points of view by several

φ

φ

! �� SRWHQWLDWLRQ

� �� GHSUHVVLRQ

F

F

θ0

SRVWV\QDSWLF DFWLYLW\ F

15

researchers, primarily by Nathan Intrator. Objective function for the originally

suggested learning rule was formulated in [Intrator92] and the learning rule was

derived from the risk function for both linear and nonlinear neurons and for a network

with feed-forward inhibition. Analysis of the fixed points and stability of the solution

with respect to noise of different kinds was performed. A differential equations for

neural network were analyzed in [Intrator92]. Work of Intrator and Cooper revealed

the possible field of applications of the BCM learning rule, which is far from the

biology. It was found that the BCM is a form of projection pursuit. The BCM as the

projection pursuit procedure was compared to backward propagation and PCA by

[Bachman94]. They used data from radar presentations and they found that BCM

achieves the best performance. They used an architecture with lateral connections and

suggested and derived the architecture and learning rule for recurrent BCM models.

Comparison of the BCM model and the negative feedback network [Fyfe95] was

performed in [Fyfe97]. BCM was also applied to classification of underwater

mammal sounds in [Cooper98]. The current work of this group concentrates on the

move from artificial visual environments into natural scene environment as in

[Cooper97].

�� ([SHULPHQWV ZLWK WKH %&0 1HXURQ DQG 7LPH 6HTXHQFHV

3.3.1. Introduction

The BCM learning rule was applied in several different situations. It was shown,

for example, that a modified version can perform an efficient computation of

projection pursuit, determine bimodality, statistical skewness and curtosis of the

distribution of input data.

The original motivation for our experiments was the internal state of the BCM

neuron - its threshold θ. We believed it should have a significant impact on the ability

of the BCM neuron (or some kind of network based on BCM neurons) to recognize

some properties of the symbolic time sequences.

The goal of all experiments which we performed with the BCM learning rule was

to study the behavior of a single neuron exposed to symbolic time sequences built

from 2 symbols, namely 0 and 1. Symbols from the input sequence were feeding the

weighted neuron’s input. The weight w and the threshold θ were updated after each

16

input. Figure 3.2 shows a scheme of this circuit. The development of both weight and

threshold was observed.

Figure 3.2. General scheme of circuit for experiments.

We performed the experiments with the following variants of the BCM neuron:

• linear neuron

• linear neuron with additive noise added to the input

• neuron with sigmoid function

• neuron with sigmoid function with noise on input

• neuron with sigmoid function with one recurrent connection

In order to have a better possibility to compare the behavior of neuron for sequences

with different degree of determinism, we used three different kinds of sequences:

• deterministic (produced by final-state automaton) (DET)

• sequence produced by Hidden Markov Model automaton (HMM) [Rabiner86]

• random sequence (Bernoulli source) (RND)

We worked with 4 different sets of sequences named A, B, C, D. Each set

contained one (periodical) deterministic, one HMM and one random sequence. The

difference between the sets of sequences was primarily in the nature of HMM

automaton for HMM sequences and in one case also in the number of symbols ‘1’

present in the sequence. Sequences within each set contained the same number of

symbols ‘1’ in order to make the results more comparable.

The different degree of determinism of sequences in one set can be appropriately

measured by the entropy of the sequence. Even more information can be traced from

entropy spectrum. We describe these concepts in the next chapter.

Each experiment produced a sequence of cell activations (which is proportional

to the input weight in the case of one input) and a sequence of threshold values. We

have found out that for two sequences with different entropies and the same number

6\PEROLF

6HTXHQFH

%&0

θ�W�� F�W�Z�W�

17

of symbols ‘1’ the resulting weight differs, i.e. it is dependent on the entropy of the

input sequence (or at least has a strong relation with it).

We were interested also in the nature of the dynamics of the threshold θ. The

weight becomes stabilized after certain number of iterations. The role of the threshold

θ is to compensate for changes in input. Thus it might be interesting to investigate the

sequence of differences from θ’s average (expected value), which themselves can be

transformed into a symbolic time sequence. We have computed entropy spectra for

these θ symbolic sequences of differences and compared them with the entropy

spectrum of the original input sequences. Further details can be found in the section

with results.

3.3.2. Theoretical background3

In this section, we will give the theoretical details of the implementation of the

BCM neuron as well as the measures used for reasoning about the results.

Let d is the input to neuron, w is the weight of input connection, θ is the

threshold for weight modification and φ is the polynomial function used in the BCM

weight modification rule. The activity at time t is then given by:

c(t) = w(t).d(t) (3.5)

for linear neuron and

c(t) = σ(w(t).d(t)) (3.6)

for nonlinear neuron, where σ is a usual sigmoid function

σ(x)=2 / (1 - e-2x) – 1 (3.7)

with the derivation:

σ`(x)= 4e-2x/(1+e-2x)2 (3.8)

Weight is updated by the rule:

w(t+1)= w(t) + ∆w (3.9)

where the weight modification is

∆w = η . φ ( c(t), θ ) . d (3.10)

for linear neuron and

∆w = η . φ ( c(t), θ ) . σ`(x) . d (3.11)

�7KH IRUPXODV UHJDUGLQJ %&0 DUH IURP >,QWUDWRU��@ DQG WKH LQIRUPDWLQ WKHRU\ HTXWLRQV DUH SURYLGHG

E\ 3HWHU 7L�R� 6HH IRU H[DPSOH >7L�R��@ RU >.DWRN��@�

18

for nonlinear neuron, where

φ( x, θ ) = c.(c - θ) (3.12)

In cases with additive noise, we used d(t)+noise(t) instead of d(t), where noise(t)

is a Bernoulli source, uniformly distributed in the interval [-α,α].

The threshold θ is defined by (3.4), but the continuous integral was approximated

by the discrete sum, particularly by accumulating the weighted average activity over

recent iterations (see the source code). In addition, a scaling parameter c0 allows

scaling of this average:

(3.13)

For determining the entropy of symbolic sequences, we have used the calculation

of entropies for increasing length of window.

If the probability of some event is Pi, its information content is given by the

formula:

(3.14)

Based on this, Shannon introduced the entropy of the data as the sum of products

of probabilities and information contents of all possible events:

(3.15)

For a window of length n there exist 2n different words (having only two symbols

‘0’ and ‘1’). Let the probability of word wi of length n is Pn(wi). Entropy of the

sequence for window of length n is thus:

(3.16)

Entropy can be understood as a measure of uncertainty. The uncertainty is high

for large Hn, i.e. all possible words are almost equally likely to occur in the sequence.

For small values of Hn, the uncertainty is smaller, i.e. the sequence is more

deterministic.

However, we are interested in some measure normalized for one symbol. We use

for this purpose the entropy per 1 symbol:

L

L

3,

�ORJ

�=

∑=L

LLS,3,

( )∑=

−=QZ

QQQZ3Z3+ ��ORJ

�

�

� ��

F

WFτθ =

19

(3.17)

The desired entropy of the symbolic sequence is then:

(3.18)

To gain a better picture of the internal structure of the sequence, we employed

entropy spectra. They consist of entropies of sequence computed for different

temperatures T =1/β .

Instead of Pn(w), we will use

(3.19)

This transformed probability enables to reveal words which are more likely to

take over the less frequent words when the temperature is positive and larger than 1

(i.e. β>1). Opposite effect happens for negative temperatures (β<0), when the least

frequent words are dominant. Therefore entropies computed for different temperatures

uncover more information about the histogram of the sequences. (A special case when

T→ ∞, i.e. β→0 is called a topological case. In this case, all words, which are present

in the input sequence, are represented in the histogram by 1, and all words, which are

not present in the input sequence, by 0).

6HTXHQFHV JHQHUDWHG E\ WKH GHWHUPLQLVWLF VWDWH DXWRPDWRQ�

Deterministic state automaton, which we used in the experiments, is defined as

follows:

A=(Symbols,States,Init,Generate,Next),

where Symbols is the output alphabet, |Symbols|=N,

States is the set of possible states of automaton,

Init∈ States is the initial state, and

Generate: States→ Symbols is a function generating some symbol in each state

Next: States→ States is a function, which determines the consequentiality of

states.

nn

KK∞→

= OLP

Q

+K

Q

Q=

∑∈

=QDZ

Q

Q

Q

Z3

Z3Z3

��

��

a� β

β

β

20

∑∈

=6\PEROVE

�D�E� 1Prob

∑∈

=6WDWHVE

�D�E� 1Trans

6HTXHQFHV JHQHUDWHG E\ WKH +LGGHQ 0DUNRY 0RGHO DXWRPDWRQ�

HMM automaton, which we used in the experiments, is defined as follows:

A=(Symbols, States, Init, Prob, Trans),

where Symbols is the output alphabet, |Symbols|=N,

States is the set of possible states of automaton,

Init∈ States is the initial state,

Prob: States x Symbols→ [0,1] is a function that assigns output probabilities to

each of possible states and symbols, where

(3.20)

for all a∈ States and finally

Trans: States x States→[0,1] is a transition function which assigns a

probability of transition from one state to another. Again,

(3.21)

for all a∈ States.

Figure 3.3.a: An example of the HMM automaton (state diagram).

The algorithm of the automaton can be summarized in these steps:

1. start in the initial state S←Init

2. according to the probability distribution Prob(S,Sym), choose a symbol Sym

3. output symbol Sym

VWDWH �

��

��

VWDWH �

��

��

VWDWH �

��

��

� ! ��

�� ! ��

��

� ! ��

��

� ! ��

�� ! ��

��

� ! ��

�� ! ��

��

� ! ��

��

� ! ��

��

21

4. according to the probability distribution Trans(S,newS), choose a new state newS

5. change the state S←newS and continue to step 2.

An example of the HMM automaton is at Figures 3.3.a and 3.3.b.

Figure 3.3.b: An example of the HMM automaton (automaton definition).

3.3.3. Implementation

This section first describes the four sets of symbolic sequences A, B, C, D. Then

we will proceed with the design of software which we have built for our simulations

and finally we will explore the parameters which can be used for tuning the model.

The set A consisted of these sequences:

• A1 – deterministic, generated by the automaton

A1=({‘0’,’1’},{s1,s2,s3},s1,σA1,δA1)

σA1(s1)=1 δA1(s1)=s2

σA1(s2)=0 δA1(s2)=s3

σA1(s3)=0 δA1(s3)=s1

• A2 – HMM:

A2=({‘0’,’1’},{s1,s2},s1, σA2,δA2)

σA2(s1,’0’)= 0.433 δA2(s1,s1)=0.75

σA2(s1,’1’)= 0.567 δA2(s1,s2)=0.25

σA2(s2,’0’)= 0.9 δA2(s2,s1)=0.25

σA2(s2,’1’)= 0.1 δA2(s2,s2)=0.75

A=(Symbols,States,Init,Prob,Trans),

Symbols={‘0’,’1’}, States={state 1,state 2,state 3}, Init=state 1,

Prob(state 1,‘0’)=0.1 Trans(state 1,state 1)=0.1 Trans(state 3,state 1)=0.6



Prob(state 2,‘1’)=0.1 Trans(state 2,state 1)=0.6



22

• A3 – random sequence containing the same number of symbols ‘1’ as the

sequence A2.

The set B consisted of these sequences:

• B1 – deterministic, generated by the automaton

B1=({‘0’,’1’},{s1,s2,s3,s4,s5,s6,s7,s8,s9,s10},s1,σB1,δB1)

σB1(s1)=’1’ σB1(s7)=’0’

σB1(s2)=’1’ σB1(s8)=’0’

σB1(s3)=’1’ σB1(s9)=’0’

σB1(s4)=’1’ σB1(s10)=’0’

σB1(s5)=’1’

σB1(s6)=’0’ δB1(si)= s(i mod 10)+1

• B2 – HMM:

B2=({‘0’,’1’},{s1,s2,s3,s4,s5,s6,s7,s8,s9,s10},s1,σB2,δB2)

σB2(s1,’0’)= 0.05 σB2(s1,’1’)= 0.95

σB2(s2,’0’)= 0.15 σB2(s2,’1’)= 0.85

σB2(s3,’0’)= 0.25 σB2(s3,’1’)= 0.75

σB2(s4,’0’)= 0.35 σB2(s4,’1’)= 0.65

σB2(s5,’0’)= 0.45 σB2(s5,’1’)= 0.55

σB2(s6,’0’)= 0.55 σB2(s6,’1’)= 0.45

σB2(s7,’0’)= 0.65 σB2(s7,’1’)= 0.35

σB2(s8,’0’)= 0.75 σB2(s8,’1’)= 0.25

σB2(s9,’0’)= 0.85 σB2(s9,’1’)= 0.15

σB2(s10,’0’)= 0.95 σB2(s10,’1’)= 0.05

δB2(si,sj)=25.0, for j=i

δB2(si,sj)=75.0, for j=(i mod 10)+1

δB2(si,sj)=0.0, otherwise.

• B3 – random sequence containing the same number of symbols ‘1’ as the

sequence B2.

The set C consisted of these sequences:

23

• C1 – deterministic, generated by the automaton

C1=({‘0’,’1’},{s1,s2,s3,s4},s1,σB1,δB1)

σC1(s1)=’0’

σC1(s2)=’1’

σC1(s3)=’1’

σC1(s4)=’0’ δC1(si)= s(i mod 4)+1

• C2 – HMM:

C2=({‘0’,’1’},{s1,s2,s3,s4},s1, σB2,δB2)

σC2(s1,’0’)= 0.6502 σC2(s1,’1’)= 0.3498

σC2(s2,’0’)= 0.0 σC2(s2,’1’)= 1.0

σC2(s3,’0’)= 0.0 σC2(s3,’1’)= 1.0

σC2(s4,’0’)= 1.0 σC2(s4,’1’)= 0.0

δC2(s1,s1)=0.7 δC2(s1,s2)=0.3 δC2(s1,s3)=δC2(s1,s4)=0.0

δC2(si,sj)=1.0, for j=(i mod 4)+1, i>1

δC2(si,sj)=0.0, otherwise.

• C3 – random sequence containing the same number of symbols ‘1’ as the

sequence C2.

The set D consisted of these sequences:

• D1 – deterministic, generated by the automaton

D1=({‘0’,’1’},{s1,s2,s3,s4,s5,s6},s1,σD1,δD1)

σD1(s1)=’0’ σD1(s5)=’0’

σD1(s2)=’1’ σD1(s6)=’0’

σD1(s3)=’1’

σD1(s4)=’0’ δD1(si)= s(i mod 6)+1

• D2 – HMM:

D2=({‘0’,’1’},{s1,s2,s3,s4,s5,s6},s1,σD2,δD2)

σD2(s1,’0’)= 0.9 σD2(s1,’1’)= 0.1

σD2(s2,’0’)= 0.1 σD2(s2,’1’)= 0.9

σD2(s3,’0’)= 0.1 σD2(s3,’1’)= 0.9

σD2(s4,’0’)= 0.9 σD2(s4,’1’)= 0.1

24

σD2(s5,’0’)= 0.1 σD2(s5,’1’)= 0.9

σD2(s6,’0’)= 1.9 σD2(s6,’1’)= 0.1

δD2(si,sj)=0.05, for j=i

δD2(si,sj)=0.95, for j=(i mod 6)+1

δD2(si,sj)=0.0, otherwise.

• D3 – random sequence containing the same number of symbols ‘1’ as the

sequence D2.

The sequences were computer generated using the software tools we have built

for our experiments. The diagram at Figure 3.4. shows the flow of data between the

individual pieces.

Listings of the programs and sample data files can be found in Appendix A. The

detailed account of each utility follows.

*(16B+00 QDPHBRIBWDVN

is a general HMM automaton. It first reads the automaton parameters from text file

“name_of_task.hmm”. It starts in the initial state and produces a HMM sequence of

expected length. The sequence contains one symbol per line and it is saved into file

“name_of_task.trs”. This automaton was used for generating deterministic sequences

(by setting HMM transition function δ(si,sj)=1.0, if δ(si)=sj and δ(si,sj)=0.0,

otherwise).

21(6 � QDPHBRIBILOH

counts the number of symbols ‘1’ in the input sequence. After reaching the end of file,

the overall probability of ‘1’s in the sequence is printed.

*(16B51' RXWSXWBILOH OHQJWK RQHVBSUREDELOLW\

produces a file with length symbols ‘0’ or ‘1’. The probability of symbols ‘1’ in the

output sequence is specified by the argument ones_probability.

(1 SURMHFWBQDPH OHQJWK PD[1 >�GHWDLO@

computes entropies of the sequence stored in the file “project_name.trs”. The length

argument specifies the maximum length of the input sequence and maxN specifies the

length of the largest window for which the entropy is computed.

25

%&061/,1 SURMHFWBQDPH

is the main BCM neuron simulation utility. It reads parameters of the model from

“project_name.bcm” file, initializes all neurons and feeds the neurons’ inputs with

symbols from “project_name.trs”. It periodically stores actual weights and thresholds

θ into the output log-file “project_name.out”. The neuron training is performed by a

function train_cell() and the actual cell activity is computed by the function

cycle_cell().

3/27 IRUPDWBILOH GDWDBILOH >GDWDBILOH«@

is a tool for plotting the data produced by bcmsnlin at the computer screen in order to

have a fast way for tuning the parameters of the model. This program reads

information about how to plot the data from “format_file[.plt]”. Plotted data may

originate in several data files. This allows comparing the different output files and

gives to possibility to see trends influenced by changing parameters.

$9(5$*( SURMHFWBQDPH FROXPQV VNLS Q

is a simple utility for computing an average of given number of columns in the output

file “project_name.out”. Since the neuron’s weights need some time to settle down,

the program provides two arguments – skip and n which specify how many initial

outputs should be ignored and how many should be taken into account, respectively.

The output file “project_name.av” contains the averages for columns 1..columns.

&203$5( SURMHFWBQDPH FROXPQ VNLS Q

uses the average from file “project_name.av” to produce a new symbolic sequence

with symbols ‘0’ and ‘1’ based on the differences of the data in specified column in

file “project_name.out” from its average. Symbols ‘0’ and ‘1’ have the meaning

below and above the average, respectively.

(175B63(&7580 SURMHFWBQDPH�WUV V\PBILOH WHPSBILOH EORFNBOHQJWK

computes entropy spectrum for symbolic sequence “project_name.trs” with symbols

specified in “sym_file” and for temperatures named in “temp_file” for a single block

length defined by block_length argument. The resulting table is written into file

“project_name_[window_length].es”.

3DUDPHWHUV RI WKH VLPXODWLRQ�

Before running the simulation, several parameters of the BCM neuron should be

specified. Our model (always with only one input) must have had the following

parameters:

26

• shape of the sigmoid function (specified by its range [sig_min,sig_max])

• c0 − constant for scaling θ

• input noise amplitude [-α,α]

• speed of learning (η)

• number of iterations (N)

• time averaging coefficient τ (specified by K=1/τ)

• which numeric value does symbol ‘0’ represent

• which numeric value does symbol ‘1’ represent

• according to output logging: how often the actual values of weight and θ are

stored into the output file (output_step) and whether the average or actual

values should be used

27

Figure 3.4. Chart Diagram of the Simulation software.

�+00

�+LGGHQ 0DUNRY

0RGHO DXWRPDWRQ

GHVFULSWLRQ�

JHQVBKPP

�JHQHUDWLQJ +00

VHTXHQFHV�

�756

�V\PEROLF 6(48(1&(�

RQHV

�GHWHUPLQLQJ �

RI V\PEROV ��

JHQVB51'

�*HQHUDWLQJ

UDQGRP

VHTXHQFHV�

HQ

�FRPSXWLQJ

(QWURS\�

�EFP

�VLPXODWLRQ

SDUDPHWHUV

GHVFULSWLRQ�

%&061/,1

�FRPSXWLQJ

%&0 1(8521

6,08/$7,21�

�RXW

�UHFRUG �/2*� RI

GHYHORSPHQW RI

DQG ZHLJKW�θ

�GLVSOD\LQJ

GHYHORSPHQW RI

DQG ZHLJKW�

SORW

θ

&20387,QJ

DYHUDJH�

DYHUDJH

�

FRPSDUH

�&203$5,1*

$&78$/ 9$/8(6

LQ /2* ZLWK

DYHUDJH�

(175B63(

�&20387,1*

(17523,&

63(&7580�

�(6

�(17523,(6

)25 ',))(5(17

7(03(5$785(6�

�6(77,1*6

)25 3$,17,1*

7+( *5$3+�

�3/7

�7(0

�/,67 2) 7(03(5$785(6

)25 :+,&+ 7+( (17523<

6+28/' %( &20387('�

�(QWURSLHV

IRU GLIIHUHQW

ZLQGRZ OHQJWK�

�(1

�$9(5$*(V 2)

$// &2O8016�

�$9

�756

�V\PEROLF 6(48(1&(�06 ([FHO

�H[WHQVLRQ

�SXUSRVH RI ),/(�

SURJUDPB1$0(

�SXUSRVH RI

SURJUDP�

/HJHQG�

28

3.3.4. Results

First, we will discuss the results we acquired from the plot program. We ran

experiments for all input sequences A1,A2,A3…D1,D2,D3. The entropies of

sequences were different (see Figure 3.5.). We hoped that the weight of the neuron

would follow these differences.

(QWURSLHV RI VHTXHQFHV $� �'(7�� $� �+00�� $� �51'� FRPSXWHG E\ (1 XWLOLW\windowlength N

$� HQWURS\+�1�

$�� QRUP�HQWURS\ K�Q�





1 ��

2 ��

3 ��

4 ��

5 ��

6 ��

7 ��

8 ��

9 ��

10 ��

11 ��

12 ��

Figure 3.5.a. Entropies for sequences in set A. H(N) is the entropy for window length

N and h(N) is the same normalized for one symbol. Deterministic sequence has low

entropy (h(N)→ 0), whereas random sequence has high entropy (h(N)→ ∞). HMM

automaton A2 is similar to random sequence, and therefore the difference in entropy

of A2 and A3 is low.

(QWURSLHV RI VHTXHQFHV %� �'(7�� %� �+00�� %� �51'� FRPSXWHG E\ (1 XWLOLW\windowlength N

%� HQWURS\

+�1�

%�� QRUP�

HQWURS\ K�Q�

%� HQWURS\

+�1�

%�� QRUP�

HQWURS\ K�Q�

%� HQWURS\

+�1�

%�� QRUP�

HQWURS\ K�Q�

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

��

Figure 3.5.b. Entropies for sequences in set B. In this case, the difference

between entropies of B2 and B3 is small although larger than for the A set.

29

(QWURSLHV RI VHTXHQFHV &� �'(7�� &� �+00�� &� �51'� FRPSXWHG E\ (1 XWLOLW\windowlength N

&� HQWURS\

+�1�

&�� QRUP�

HQWURS\ K�Q�

&� HQWURS\

+�1�

&�� QRUP�

HQWURS\ K�Q�

&� HQWURS\

+�1�

&�� QRUP�

HQWURS\ K�Q�

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

��

Figure 3.5.c. Entropies for sequences in set C. The difference between entropies

of C2 and C3 is larger than for the B set.

(QWURSLHV RI VHTXHQFHV '� �'(7�� '� �+00�� '� �51'� FRPSXWHG E\ (1 XWLOLW\windowlength N

'� HQWURS\

+�1�

'�� QRUP�

HQWURS\ K�Q�

'� HQWURS\

+�1�

'�� QRUP�

HQWURS\ K�Q�

'� HQWURS\

+�1�

'�� QRUP�

HQWURS\ K�Q�

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

��

Figure 3.5.d. Entropies for sequences in set D. The difference between entropies

of D2 and D3 is large, because the HMM automaton for D2 contains deterministic

parts.

However, the experiments with the linear neuron (i.e. without the sigmoid

function, (3.5)) showed that the neuron is not very sensitive to differences in entropy

of the input sequences (see Figures 3.6. and 3.7.). Tuning the parameters did not make

the difference. Therefore, we switched to a nonlinear neuron with sigmoid transition

function (3.6), which improved the situation. In most cases, the neuron did recognize

the differences in the input sequences and the resulting weight (after the convergent

process) separated sequences with different entropies (see Figures 3.8 and 3.9.).

30

Figure 3.6. Development of the weight (y-axis) with time constant τ=10 for linear

neuron for symbolic time sequence with 200000 symbols (x-axis).We can see that the

weight doesn’t clearly separate the sequences. We used the following parameters:

learning speed η=0.001, c0=0.85, cell_sig_min=-2, cell_sig_max=2, length of

sequence: 200 000 symbols. The chart was obtained by shareware screen-grabbing

utility and applying a “dilate filter” on the resulting image in order to improve

visibility of curves.

31

Figure 3.7. Development of the weight with time constant τ=20 for linearneuron. Here the weights for different sequences are even more undistinguishable.

Figure 3.8. Development of the weight with time constant τ=10 for nonlinearneuron. All other parameters remain the same as for the linear case. The weight of

the input separates the sequences.

32

Figure 3.9. Development of the weight with time constant τ=10 for nonlinearneuron.

An interesting effect appears for τ=2, when the weight difference is the highest

(see Figure 3.10). This should be compared with the highest correlation of entropy

spectrum of θ time-sequence for τ=2, with the entropy spectrum of the original

sequence, see below.

To test the model robustness, we have introduced a QRLVH parameter. A randomly

distributed additive noise (not Gaussian) on the input did have only a quantitative, not

the qualitative impact on the model. The weight separation was not as clear as in the

case without the noise, which can be explained by changing the properties of the input

sequence. HMM generated sequence is already quite random and adding noise results

in a sequence, which cannot be represented by the original HMM, automaton and

therefore has different properties. Because the noise didn’t influence the quality of the

model, we removed it from the experiments.

Changing the PHDQLQJ RI V\PEROV µ�¶ DQG µ�¶ had the following effect: if we

replaced ‘0’ with ‘0.25’ and ‘1’ with ‘0.75’, the adaptation process retarded down, but

the final result of weight separation did not changed. This is an important property of

33

our model, because the weight changes are pure ‘0’ for pure ‘0’ in the input (see

(3.10) and (3.11)).

6KDSH RI WKH VLJPRLG IXQFWLRQ drives the degree of nonlinearity. For

sig_min→ -∞ and sig_max→ ∞ we get a linear transition function. In our experiments

with linear neuron, we have used the values –400000 and 400000. For nonlinear

neuron, we used –2 and 2. The shape has major influence on the final point of weight

convergence.

Also the θ scaling FRQVWDQW F� is used to drive the process of weight

development. It is usually smaller than 2 (we used 0.85). Increasing the constant

makes the φ parabola steeper and the input weight less stable.

Stability of the input weight is ensured by using very small OHDUQLQJ VSHHG η. A

reasonable stability is achieved for value 0.001, which we used in the experiments.

Figure 3.10. Development of the weight with time constant τ=2 for nonlinearneuron (similar effect appears also for linear neuron). All other parameters remainthe same as for the former case. The weight of the input separates the sequences.

34

Dynamic changes in the input are compensated by the dynamic threshold θ (see

Figures 3.11 and 3.12). Increased potentiation in the weight results in increasing the

average activity and in increasing of θ. Low probability of symbols ‘1’ in input results

in decreasing the average neuron activity and also in decreasing the threshold. In this

way, the dynamic threshold adapts to the input. We should also note that although the

probability of ‘1’ is the important factor for the resulting converged weight, it is not

the only one. All sequences in one set contained the same number of symbols ‘1’, but

the resulting weights are different for nonlinear neuron. To find another properties of

the input time sequences, which might be significant, we have employed entropy

spectra.

Figure 3.11. The development of dynamic threshold θ for symbolic timesequences in the set B computed with the same parameters and linear neuron. Each

point of the chart curve corresponds to θ averaged over last 500 iterations. It takes upto 70000 iterations for θ and the input weight to stabilize. Therefore we used forcomputing entropy spectra (see below) only the part of the sequence starting at

70000th symbol.

35

Figure 3.12. If the onset of θ is too rapid, it might begin to oscillate. This isusual for linear neuron and very large time window constant τ. (here τ=1000 and the

other parameters are the same as in the former case).

The entropies of input sequences at Figure 3.5. demonstrate that the sequences

with more complicated internal structures have higher entropy. The transformed

entropies for high temperatures show the structure with respect to very likely words.

Entropies for low temperatures show the structure with respect to very unlikely words.

A special case for T→ ∞ is called topological entropy, determines how many different

words are in the sequence.

Deterministic sequences have constant and low entropies for all temperatures,

because they consist of a periodical subsequence repeated many times. On the other

hand, random sequences contain all possible words for a given window length,

although not periodically but randomly distributed. Therefore the entropy is high. We

computed the entropy spectra up to the length of window = 12, due to the

computational complexity and huge amount of data. It can be also observed for

completely random sequences (Bernoulli source) that the entropy decreases for high

temperatures and this is caused only by the bounded length of the sequence (the

number of words for large window is high) – we worked with sequences of 200000

symbols.

36

Second part of the results discusses the entropy spectra of sequences created by

comparing the actual value of the threshold θ with its long-term average. They are

included in the Appendix B.

First type of figures consists of entropy spectra computed for fixed length of the

window. Different figures show these entropy spectra for different values of τ

parameter of the BCM model. Our goal was to find some kind of correlation between

the structure of the sequence, length of the window used for computing the entropy,

and the τ parameter.

Second type of figures combines all figures of the first type for a particular

sequence into one chart, where the sums of differences of the entropies of individual

sequences of θ changes (added for all measured temperatures, i.e. the values around 0

are scanned more densily) and the entropy of the original sequence (also added for all

measured temperatures) are plotted against different window length used for

computing the entropy. Each chart contains data for several different τ parameters,

which allows, for example, comparing, which sequence is the best model of the

original sequence.

All figures reveal that the sequence of θ changes is most similar to the original

sequence when the τ parameter is equal 2 (except the case τ=1, in which case the

sequence of θ changes is almost identical to the original sequence). This supports the

previous findings that the weight of the input connection separates the sequences best

in the case τ =2.

When exploring the figures, the original sequence should be observed first. If the

curve of the entropy spectrum turns down quickly, it means that the sequence contains

few words, which are much more probable than the others. If it declines slowly (as in

the case of C3), it means that most words present in the sequence are equally frequent.

The negative temperatures exhibit the situation with less frequent words. Again, if the

decline is rapid, the sequence contains few words, which have precious occurrence.

Sequences with moderate slope in the negative part of the spectrum curve contain

words with meager frequency.

We have included all the entropy spectra for the sequences C2 and C3, because

of the special nature of the C2 HMM automaton. When the state 2 is achieved, the

37

automaton always generates sequence ‘110’. Therefore the frequency of words

containing the sequence ‘110’ is very high.

Last, we have to mention also the experiments with the model with one recurrent

connection. BCM has been successfully applied in several complicated architectures,

some containing also recurrent connection. Another motivation lies in the fact that one

of the usual tools for modeling symbolic time sequences are recurrent neural

networks. However, the difficulty lies in building the learning rule for the recurrent

connection, which should be derived from the cost functions. Cost function requires a

concrete goal, which has to be reached by the network. We have applied the learning

rule for the input weight also for weight of the recurrent connection. The difference in

results was not significant to mention and the challenge to develop a recurrent

architecture remains for the future work.

38

5HVXPH

The first part of the thesis contains an overview of unsupervised artificial neural

network architectures. Their principles, advantages and disadvantages are named.

Second part focuses on the specific example of the Hebbian learning, the BCM

learning rule. In addition to the summary of the BCM bibliography, we have used the

present variant of this rule for experiments with time sequences for both linear and

nonlinear transition functions. We have shown that the cell activity after some initial

period depends on the internal structure of the input sequence. All sequences, i.e.

deterministic, HMM and random in one set contained the same number of symols ‘1’,

but they differed in the complexity of the internal structure. Our measure of this

complexity was the value of the entropy per one symbol. We have found that for two

sequences with different entropies and the same number of symbols ‘1’, the resulting

weight of the BCM neuron differes, i.e. it is dependent on the internal structure of the

input sequence. We have supported this statement by exploring the properties of both

input symbolic sequences and symbolic sequences produced by comparing the

dynamic threshold θ to its long-term average. We computed the entropies for input

sequences and entropy spectra for the θ sequences. Here, we have found that the both

nonlinear and linear BCM neuron with the shortest “internal memory”, i.e. τ=2, is

able to model the input sequence most closely. We discussed the role of the

parameters of the model and demonstrated that and how the nonlinearity can improve

the characteristics of the development of converged input weight. Thanks to the

entropy spectra, a better view of the internal structure of sequences can be obtained.

We have demonstrated an example of application of this technique for studying

symbolic time sequences.

39

5HIHUHQFHV

[%DFKPDQ��] Bachman, C., M., Musman, S., A., Luong, D., and Shultz, A.,

Unsupervised BCM projection pursuit algorithms for classification of simulated radar

presentations. Neural Networks, Vol. 7, No. 4, p.709-728, 1994.

[%HDU��] Bear, M., F., Cooper, L., N., and Ebner, F., F., A physiological

basis for a theory of synapse modification. Science Wash. DC 237: p. 42-48, 1987.

[%H�XãNRYi��] %H�XãNRYi� �� 'LDPRQG� 0�(�� (EQHU )�)� '\QDPLF V\QDSWLF

modification threshold: computational model of experience-dependent palasticity in

adult rat barrel cortex, Proc. Natl. Acad. Sci. USA, Vol. 91, p. 4791-4795, 1994.

[%H�XãNRYi��] %H�XãNRYi� �� 0RGHOOLQJ SODVWLFLW\ LQ UDG EDUUHO FRUWH[

induced by one spared whisker, in: Artificial Neural Networks: 7th international

conference; proceedings (Lecture Notes in Computer Science vol 1327), 1997.

[%LHQHQVWRFN��] Bienenstock E., L., Cooper L., N., and Munro P., W., Theory

for the development of neuron selectivity: orientation specificity and binocular

interaction in visual cortex, J. Neuroscience, vol.2, p.32-48, 1982.

[&ORWKLDX[��] Clothiaux E., E., Bear M., F., Cooper L., N., Synaptic plasticity

in visual cortex: comparison of theory with experiment, J.Neurophysiology, Vol 66.,

No.5, 1991.

[&RRSHU��] Cooper, L., N., Liberman, F., and Oja, E., A theory for the

acquisition and loss of neuron specificity in visual cortex, Biol. Cybern., Vol. 33, p.9-

28, 1979.

[&RRSHU��] Cooper, L., N., and Scofield, C., L., Mean-field theory of a

neural network. Proc. Natl. Acad. Sci. USA, Vol.85, p.1973-77, 1988.

[&RRSHU��] Shouval, H., Intrator N., Cooper L., N., BCM network develops

orientation selectivity and ocular dominance in natural scene environment, to appear

in Vision Research, 1997.

[&RRSHU��] Castellani, G., C., Intrator N., Shouval H., Cooper L.N.,

Characterizing solutions of a BCM learning rule in a network of lateral interacting

non-linear neurons, tech.report, 1998.

[)\IH��] Fyfe, C., A general exploratory projection pursuit network,

Neural Processing Letters, Vol.2(3), p. 17-19, 1995.

[)\IH��] Fyfe, C., A comparative study of two neural methods of

exploratory projection pursuit, Neural Networks, Vol 10, No.2, p.255-262, 1997.

40

[+HEE��] Hebb, D., The organization of behavior. J. Wiley and Sons,

New York, 1949.

[+HUW]��] Hertz, J., Krogh A., and Palmer, R., Introduction to the theory

of neural computation. Addison-Wesley, Redwood City, CA, 1991.

[+XEHO��] Hubel, D., H., and Wiesel, T.N., Integrative action in the cat’s

lateral geniculate body. Journal of Physiology, Vol.148, p.574-491, 1959.

[,QWUDWRU��] Intrator, N., Cooper, L.N., Objective function formulation of

the BCM theory of visual cortical platicity: statistical connections, stability

conditions, Neural Networks, Vol 5. p.3-17, 1992.

[.DWRN��] Katok, A. Hasselblatt B., Introduction to the Modern Theory of

Dynamical Systems, Cambridge University Press, 1995.

[.RKRQHQ��] Kohonen, T., Physiological interpretation of the self-organizing

map algorithm, Neural Networks, Vol 6., p.895-905, 1993.

[.RKRQHQ��] Kohonen, T., Self-organizing Maps, Springer, 1995.

[/LQVNHU��] Linsker, R., From basic network principles to neural

architecture, Proceedings of the National Academy of Sciences of the USA, Vol. 83, p.

7508-7512, 1986.

[1DVV��] Nass, M., N., and Cooper, L.N., A theory for the development

of feature detecting cells in visual cortex, Biol. Cyb., Vol 19., p. 1-18, 1975.

[2MD��] Oja, E., Neural networks, principal components, and subspaces,

International Journal of Neural Systems, Vol. 1, p. 61-68, 1989.

[5DELQHU��] Rabiner, R., L., Juang, B.H., An introduction to hidden markov

models, IEEE ASSP Magazine, Vol 3., p. 4-16,1986.

[6DQJHU��] Sanger, T., D., Optimal unsupervised learning in a single-layer

linear feedforward neural network, Neural Networks, Vol. 2, p. 459-473, 1989.

[6HOPDQ��] Selman, B., Brooks, R., A., Dean, T., Horvitz, E., Mitchell, T.,

M., Nillson, N., J., Challenges for artificial intelligence, Proceedings of AAAI-96,

1996.

[7L�R��] Ti�R� 3�� .|WHOHV� 0�� 0RGHOLQJ FRPSOH[ V\PEROLF sequences

with neural and hybrid neural based systems, submitted to IEEE Transactions of

Neural Networks, 1996.

41

$SSHQGL[ $� VRXUFH FRGH DQG H[DPSOHV RI GDWD ILOHV�

All programs were written in ANSI C. We used a DOS version of free GNU C

compiler (GCC) from DJ Delorie and a makefile to build all the programs.

First part of the appendix B contains source code of all utilities we have built for the

experiments. The second part contains examples of data files.

3URJUDP VRXUFH FRGH

sources of all programs are included in the file src.zip.

$SSHQGL[ %� HQWURSLF VSHFWUD

Documents

Unsupervised Neural Networks Learning Rules, BCM Learning Rule