Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

1/30

Unsupervised Recursive Sequence Processing

Marc Strickert, Barbara Hammer

Research group LNM, Department of Mathematics/Computer Science,

University of Osnabr uck, Germany

Sebastian Blohm

Institute for Cognitive Science,

University of Osnabr uck, Germany

Abstract

The self organizing map (SOM) is a valuable tool for data visualization and data mining for

potentially high dimensional data of an a priori fixed dimensionality. We investigate SOMsfor sequences and propose the SOM-S architecture for sequential data. Sequences of poten-

tially infinite length are recursively processed by integrating the currently presented item

and the recent map activation, as proposed in [11]. We combine that approach with the

hyperbolic neighborhood of Ritter [29], in order to account for the representation of pos-

sibly exponentially increasing sequence diversification over time. Discrete and real-valued

sequences can be processed efficiently with this method, as we will show in experiments.

Temporal dependencies can be reliably extracted from a trained SOM. U-Matrix methods,

adapted to sequence processing SOMs, allow the detection of clusters also for real-valued

sequence elements.

Key words: Self-organizing map, sequence processing, recurrent models, hyperbolicSOM, U-Matrix, Markov models

1 Introduction

Unsupervised clustering by means of the self organizing map (SOM) was first pro-

posed by Kohonen [21]. The SOM makes the exploration of high dimensional data

possible and it allows the exploration of the topological data structure. By SOMtraining, the data space is mapped to a typically two dimensional Euclidean grid

Email address: {marc,hammer}@informatik.uni-osnabrueck.de(Marc Strickert, Barbara Hammer).

Preprint submitted to Elsevier Science 23 January 2004


2/30

of neurons, preferably in a topology preserving manner. Prominent applications of

the SOM are WEBSOM for the retrieval of text documents and PicSOM for the

recovery and ordering of pictures [18,25]. Various alternatives and extensions to

the standard SOM exist, such as statistical models, growing networks, alternative

lattice structures, or adaptive metrics [3,4,19,27,28,30,33].

If temporal or spatial data are dealt with like time series, language data, or DNA

strings sequences of potentially unrestricted length constitute a natural domain for

data analysis and classification. Unfortunately, the temporal scope is unknown in

most cases, and therefore fixed vector dimensions, as used for standard SOM, can-

not be applied. Several extensions of SOM to sequences have been proposed; for

instance, time-window techniques or the data representation by statistical features

make a processing with standard methods possible [21,28]. Due to data selection or

preprocessing, information might get lost; for this reason, a data-driven adaptation

of the metric or the grid is strongly advisable [29,33,36]. The first widely used ap-

plication of SOM in sequence processing employed the temporal trajectory of the

best matching units of a standard SOM in order to visualize speech signals and the

variations of which [20]. This approach, however, does not operate on sequences

as they are; rather, SOM is used for reducing the dimensionality of single sequence

entries and acts as a preprocessing mechanism this way. Proposed alternatives sub-stitute the standard Euclidean metric by similarity operators on sequences by in-

corporating autoregressive processes or time warping strategies [16,26,34]. These

methods are very powerful, but a major problem is their computational costs.

A fundamental way for sequence processing is a recursive approach. Supervised

recurrent networks constitute a well-established generalization of standard feedfor-

ward networks to time series; many successful applications for different sequence

classification and regression tasks are known [12,24]. Recurrent unsupervised mod-

els have also been proposed: the temporal Kohonen map (TKM) and the recurrent

SOM (RSOM) use the biologically plausible dynamics of leaky integrators [8,39],

as they occur in organisms, and explain phenomena such as direction selectivityin the visual cortex [9]. Furthermore, the models have been applied with moderate

success to learning tasks [22]. Better results have been achieved by integrating these

models into more complex systems [7,17]. Recent more powerful approaches are

the recursive SOM (RecSOM) and the SOM for structured data (SOMSD) [10,41].

These are based on a richer and explicit representation of the temporal context: they

use the activation profile of the entire map or the index of the most recent winner.

As a result, their representation ability is superior to RSOM and TKM.

A proposal to put existing unsupervised recursive models into a taxonomy can be

found in [1,2]. The latter article identifies the entity time context used by the mod-els as one of the main branches of the given taxonomy [2]. Although more general,

the models are still quite diverse, and the recent developments of [10,11,35] are

not included in the taxonomy. An earlier, simple, and elegant general description of

recurrent models with an explicit notion of context has been introduced in [13,14].

2


3/30


4/30


5/30

place by the update rule

wj = h(nhd(nj0, nj)) (si wj)

whereby (0, 1) is the learning rate. The function h describes the amount ofneuron adaptation in the neighborhood of the winner: often the Gaussian bell func-

tion h(x) = exp(x2/2) is chosen, of which the shape is narrowed during train-

ing by decreasing to ensure the neuron specialization. The function nhd(nj, nk)which measures the degree of neighborhood of the neurons ni and nj within thelattice might be induced by the simple Euclidean distance between the neuron co-

ordinates in a rectangular grid or by the shortest distance in a graph connecting the

two neurons.

Recursive models substitute the one-shot distance computation for a single entry

si by a recursive formula over all entries of a given sequence s. For all models,sequences are presented recursively, and the current sequence entry si is processedin the context which is set by its predecessors si+1, si+2, . . ..

2 The models differ

with respect to the representation of the context and in the way that the context

influences further computation.

The Temporal Kohonen Map (TKM) computes the distance of s = (s1, . . . , st)from neuron nj labeled with wj R

n by the leaky integration

dTKM(s, nj) =t

i=1

(1 )i1si wj2

where (0, 1) is a memory parameter [8]. A neuron becomes winner if thecurrent entry s1 is close to its weight wj as in standard SOM, and, in addition,

the remaining sum (1 )s2 wj + (1 )2s3 wj + . . . is also small.This additional term integrates the distances of the neurons weight from previous

sequence entries weighted by an exponentially decreasing decay factor (1 )i1.The context resulting from previous sequence entries is pointing towards neurons

of which the weights have been close to previous entries. Thus, the winner is a

neuron whose weight is close to the average presented signal for the recent time

steps.

The training for the TKM takes place by Hebbian learning in the same way as for

the standard SOM, making well-matching neurons more similar to the current input

than bad-matching neurons. At the beginning, weights wj are initialized randomlyand then iteratively adapted when data is presented. For adaptation assume that a

2 We use reverse indexing of the sequence entries, s1 denoting the most recent entry,s2, s3, . . . its predecessors.

5


6/30

sequence s is given, with si denoting the current entry and nj0 denoting the bestmatching neuron for this time step. Then the weight correction term is


As discussed in [23], the learning rule of TKM is unstable and leads to only subop-

timal results. More advanced, the Recurrent SOM (RSOM) leaky integration first

sums up the weighted directions and afterwards computes the distance [39]

dRSOM(s, nj) =

t

i=1

(1 )i1(si wj)

2

.

It represents the context in a larger space than TKM since the vectors of directions

are stored instead of the scalar Euclidean distance. More importantly, the training

rule is changed. RSOM derives its learning rule directly from the objective to min-

imize the distortion error on sequences and thus adapts the weights towards the

vector of integrated directions:

wj

= h

(nhd(nj0, n

j)) y

j(i)

whereby

yj(i) =t

i=1

(1 )i1(si wj) .

Again, the already processed part of the sequence produces a context notion, and

the neuron becomes the winner for the current entry of which the weight is most

similar to the average entry for the past time steps. The training rule of RSOM takes

this fact into account by adapting the weights towards this averaged activation.We will not refer to this learning rule in the following. Instead, the way in which

sequences are represented within these two models, and the ways to improve the

representational capabilities of such maps will be of interest.

Assuming vanishing neighborhood influences for both cases TKM and RSOM,one can analytically compute the internal representation of sequences for these two

models, i.e. weights with response optimum to a given sequence s = (s1, . . . , st):the weight w is optimum for which

w =t

i=1

(1

)i1si/

t

i=1

(1

)i1

holds [40]. This explains the encoding scheme of the winner-takes-all dynamics

of TKM and RSOM. Sequences are encoded in the weight space by providing a

6


7/30

recursive partitioning very much like the one generating fractal Cantor sets. As an

example for explaining this encoding scheme, assume that binary sequences {0,1}l

are dealt with. For = 0.5, the representation of sequences of fixed length l cor-responds to an encoding in a Cantor set: the interval [0, 0.5) represents sequenceswith most recent entry s1 = 0, interval [0.5, 1) contains only codes of sequenceswith most recent entry 1. Recursive decomposition of the intervals allows to re-

cover further entries of the sequence: [0, 0.25) stands for the beginning 00. . . of asequence, [0.25, 0.5) stands for 01, [0.5, 0.75) for 10, and [0.75, 1) represents 11.By further subdivision, [0, 0.125) stands for the beginning 000. . ., [0.125, 0.25) for001, and so on. Similar encodings can be found for alternative choices of . Se-quences over discrete sets = {0, . . . , d} R can be uniquely encoded usingthis fractal partitioning if < 1/d. For larger , the subsets start to overlap, i.e.codes are no longer sorted according to their last symbols, and a code might stand

for two or more different sequences. A very small 1/d, in turn, results in anonly sparsely used space; for example the interval (d , 1] does not contain a validcode. Note that the explicit computation of this encoding stresses the superiority

of the RSOM learning rule compared to TKM update, as pointed out in [40]: the

fractal code is a fixed point for the dynamics of RSOM training, whereas TKM

converges towards the borders of the intervals, preventing the optimum fractal en-

coding scheme from developing on its own.

Fractal encoding is reasonable, but limited: it is obviously restricted to discrete

sequence entries, and real values or noise might destroy the encoded information.

Fractal codes do not differentiate between sequences of different length; e.g. the

code 0 gives optimum response to 0,00, 000, and so forth. Sequences with thiskind of encoding cannot be distinguished. In addition, the number of neurons does

not take influence on the expressiveness of the context space. The range in which

sequences are encoded is the same as the weight space. Thus, both the size of the

weight space and the computation accuracy are limiting the number of different

contexts, independently of the number of neurons of the network.

Based on these considerations, richer and in particular explicit representations of

context have been proposed. The models that we introduce in the following ex-

tend the parameter space of each neuron j by an additional vector cj , which isused to explicitly store the sequential context within which a sequence entry is ex-

pected. Depending on the model, the context cj is contained in a representationspace with different dimensionality. However, in all cases this space is independent

of the weight space and extends the expressiveness of the models in comparison

to TKM and RSOM. For each model, we will define the basic ingredients: what is

the space of context representations? How is the distance between a sequence entry

and neuron j computed, taking into account its temporal context cj? How are theweights and contexts adapted?

The Recursive SOM (RecSOM) [41] equips each neuron nj with a weight wj R

n that represents the given sequence entry, as usual. In addition, a vector cj

7


8/30

RN is provided, N denoting the number of neurons, which explicitly represents

the contextual map activation of all neurons in the previous time step. Thus, the

temporal context is represented in this model in an N-dimensional vector space, Ndenoting the number of neurons. One can think of the context as an explicit storage

of the activity profile of the whole map in the previous time step. More precisely,

distance is recursively computed by

dRecSOM((s1, . . . , st), nj) = 1s1 wj2 + 2CRecSOM(s2, . . . , st) cj

2

where 1, 2 > 0.

CRecSOM(s) = (exp(dRecSOM(s, n1)), . . . , exp(dRecSOM(s, nN)))

constitutes the context. Note that this vector is almost the vector of distances of all

neurons computed in the previous time step. These are exponentially transformed

to avoid an explosion of the values. As before, the above distance can be decom-

posed into two parts: the winner computation similar to standard SOM, and, as in

the case of RSOM and TKM, a term which assesses the context match. For Rec-

SOM the context match is a comparison of the current context when processing

the sequence, i.e. the vector of distances of the previous time step, and the expected

context cj which is stored at neuron j. That is to say, RecSOM explicitly stores con-text vectors for each neuron and compares these context vectors to their expected

contexts during the recursive computation. Since the entire map activation is taken

into account, sequences of any given fixed length can be stored, if enough neurons

are provided. Thus, the representation space for context is no longer restricted by

the weight space and its capacity now scales with the number of neurons.

For RecSOM, training is done in Hebbian style for both weights and contexts. De-

note by nj0 the winner for sequence entry i, then the weight changes are


and the context adaptation is

cj = h(nhd(nj0, nj)) (CRecSOM(si+1, . . . , st) cj)

The latter update rule makes sure that the context vectors of the winner neuron

and its neighborhood become more similar to the current context vector CRecSOM,which is computed when the sequence is processed. The learning rates are ,

(0, 1). As demonstrated in [41], this richer representation of context allows a betterquantization of time series data. In [41], various quantitative measures to evaluatetrained recursive maps are proposed, such as the temporal quantization error and

the specialization of neurons. RecSOM turns out to be clearly superior to TKM and

RSOM with respect to these measures in the experiments provided in [41].

8


9/30

However, the dimensionality of the context for RecSOM equals the number of neu-

rons N, making this approach computationally quite costly. The training of veryhuge maps with several thousands of neurons is no longer feasible for RecSOM.

Another drawback is given by the exponential activity transfer function in the term

of CRecSOM RN: specialized neurons are characterized by the fact that they have

only one or a few well-matching predecessors contributing values of about 1 to

CRecSOM; however, for a large number N of neurons, the noise influence on CRecSOMfrom other neurons destroys the valid context information, because even poorly

matching neurons contributing values of slightly above 0 are summed up in the

distance computation.

SOM for structured data (SOMSD) as proposed in [10,11] is an efficient and still

powerful alternative. SOMSD represents temporal context by the corresponding

winner index in the previous time step. Assume that a regular l-dimensional latticeof neurons is given. Each neuron nj is equipped with a weight wj R

n and a

value cj Rl which represents a compressed version of the context, the location

of the previous winner within the map [10]. The space in which context vectors

are represented is the vector space Rl for this model. The distance of sequence

s = (s1, . . . , st) from neuron nj is recursively computed by

dSOMSD((s1, . . . , st), nj) = 1s1 wj2 + 2CSOMSD(s2, . . . , sn) cj

2

where CSOMSD(s) equals the location of neuron nj with smallest dSOMSD(s, nj) in thegrid topology. Note that the context CSOMSD is an element in a low-dimensional vec-tor space, usually only R2. The distance between contexts is given by the Euclidean

metric within this vector space. The learning dynamic of SOMSD is very similar

to the dynamic of RecSOM: the current distance is defined as a mixture of two

terms, the match of the neurons weight and the current sequence entry, and the

match of the neurons context weight and the context currently computed in the

model. Thereby, the current context is represented by the location of the winningneuron of the map in the previous time step. This dynamic imposes a temporal bias

towards those neurons which context vector matches the winner location of the pre-

vious time step. It relies on the fact that a lattice structure of neurons is defined and

a distance measure of locations within the map is defined.

Due to the compressed context information, this approach is very efficient in com-

parison to RecSOM and also very large maps can be trained. In addition, noise

is suppressed in this compact representation. However, more complex context in-

formation is used than for TKM and RSOM, namely the location of the previous

winner in the map. As for RecSOM, Hebbian learning takes place for SOMSD, be-cause weight vectors and contexts are adapted in a well-known correction manner,

here by the formulas


9


10/30

and

cj = h(nhd(nj0, nj)) (CSOMSD(si+1, . . . , st) cj)

with learning rates , (0, 1). nj0 denotes the winner for sequence entry i.As demonstrated in [11], a generalization of this approach to tree structures can

reliably model structured objects and their respective topological ordering.

We would like to point out that, although these approaches seem different, theyconstitute instances of the same recursive computation scheme. As proved in [14],

the underlying recursive update dynamics comply with

d((s1, . . . , st), nj) = 1s1 wj2 + 2C(s2, . . . , sn) cj

2

in all the cases. Their specific similarity measures for weights and contexts are de-

noted by the generic expression. The approaches differ with respect to theconcrete choice of the context C: TKM and RSOM refer to only the neuron itselfand are therefore restricted to local fractal codes within the weight space; RecSOM

uses the whole map activation, which is powerful but also expensive and subjectto random neuron activations; SOMSD relies on compressed information, the lo-

cation of the winner. Note that also standard supervised recurrent networks can be

put into the generic dynamic framework by choosing the context as the output of

the sigmoidal transfer function [14]. In addition, alternative compression schemes,

such as a representation of the context by the winner content, are possible [37].

To summarize this section, essentially four different models have been proposed

for processing temporal information. The models are characterized by the way in

which context is taken into account within the map. The models are:

Standard SOM: no context representation; standard distance computation; stan-

dard competitive learning.TKM and RSOM: no explicit context representation; the distance computation

recursively refers to the distance of the previous time step; competitive learning

for the weight whereby (for RSOM) the averaged signal is used.

RecSOM: explicit context representation as N-dimensional activity profile of theprevious time step; the distance computation is given as mixture of the current

match and the match of the context stored at the neuron and the (recursively com-

puted) current context given by the processed time series; competitive learning

adapts the weight and context vectors.

SOMSD: explicit context representation as low-dimensional vector, the location

of the previously winning neuron in the map; the distance is computed recur-sively the same way as for RecSOM, whereby a distance measure for locations

in the map has to be provided; so far, the model is only available for standard

rectangular Euclidean lattices; competitive learning adapts the weight and con-

text vectors, whereby the context vectors are embedded in the Euclidean space.

10


11/30

In the following, we focus on the context representation by the winner index, as

proposed in SOMSD. This scheme offers a compact and efficient context repre-

sentation. However, it relies heavily on the neighborhood structure of the neurons,

and faithful topological ordering is essential for appropriate processing. Since for

sequential data, like for words in , the number of possible strings is an expo-nential function of their length, an Euclidean target grid with inherent power law

neighborhood growth is not suited for a topology preserving representation. The

reason for this is that the storage of temporal data is related to the representation

of trajectories on the neural grid. String processing means beginning at a node that

represents the start symbol; then, how many nodes ns can in the ideal case uniquelybe reached in a fixed number s of steps? In grids with 6 neurons per neighbor thetriangular tessellation of the Euclidean plane leads to a hexagonal superstructure,

inducing the surprising answer of ns = 6 for any choice of s > 0. Providing 7neurons per neighbor yields the exponential branching ns = 7 2

(s1) of paths.

In this respect, it is interesting to note that RecSOM can also be combined with

alternative lattice structures; in [41] a comparison is presented of RecSOM with a

standard rectangular topology and a data optimum topology provided by neural gas

(NG) [27,28]. The latter clearly leads to superior results. Unfortunately, it is not

possible to combine the optimum topology of NG with SOMSD: for NG, no gridwith straightforward neuron indexing exists. Therefore, context cannot be defined

easily by referring back to the previous winner, because no similarity measure is

available for indices of neurons within a grid topology.

Here, we extend SOMSD to grid structures with triangular grid connectivity in

order to obtain a larger flexibility for the lattice design. Apart from the standard

Euclidean plane, the sphere and the hyperbolic plane are alternative popular two-

dimensional manifolds. They differ from the Euclidean plane with respect to their

curvature: the Euclidean plane is flat, whereas the hyperbolic space has negative

curvature, and the sphere is curved positively. By computing the Euler characteris-

tics of all compact connected surfaces, it can be shown that only seven have non-negative curvature, implying that all but seven are locally isometric to the hyper-

bolic plane, which makes the study of hyperbolic spaces particularly interesting. 3

The curvature has consequences on regular tessellations of the referred manifolds as

pointed out in [30]: the number of neighbors of a grid point in a regular tessellation

of the Euclidean plane follows a power law, whereas the hyperbolic plane allows

an exponential increase of the number of neighbors. The sphere yields compact

lattices with vanishing neighborhoods, whereby a regular tessellation for which all

vertices have the same number of neighbors is impossible (with the uninteresting

exception of an approximation by one of the 5 Platonic solids). Since all these

surfaces constitute two-dimensional manifolds, they can be approximated locally

within a cell of the tessellation by a subset of the standard Euclidean plane without

3 For an excellent tool box and introduction to hyperbolic geometry see e.g.

http://www.geom.uiuc.edu/docs/forum/hype/hype.html

11


12/30

too much contortion. A global isometric embedding, however, is not possible in

general. Interestingly, for all such tessellations a data similarity measure is defined

and possibly non-isometric visualization in the 2D plane can be achieved. While 6

neighbors per neuron lead to standard Euclidean triangular meshes, for a grid with

7 neighbors or more, the graph becomes part of the 2-dimensional hyperbolic plane.

As already mentioned, exponential neighborhood growth is possible and hence an

adequate data representation can be expected for the visualization of domains with

a high connectivity of the involved objects. SOM with hyperbolic neighborhood

(HSOM) has already proved well-suited for text representation as demonstrated for

a non-recursive model in [29].

3 SOM for sequences (SOM-S)

In the following, we introduce the adaptation of SOMSD for sequences and the

general triangular grid structure, SOM for sequences (SOM-S). Standard SOMs

operate on a rectangular neuron grid embedded in a real-valued vector space. More

flexibility for the topological setup can be obtained by describing the grid in termsof a graph: neural connections are realized by assigning each neuron a set of direct

neighbors. The distance of two neurons is given by the length of a shortest path

within the lattice of neurons. Each edge is assigned the unit length 1. The number ofneighbors might vary (also within a single map). Less than 6 neighbors per neuron

lead to a subsiding neighborhood, resulting in graphs with small numbers of nodes.

Choosing more than 6 neighbors per neuron yields, as argued above, an exponentialincrease of the neighborhood size, which is convenient for representing sequences

with potentially exponential context diversification.

Unlike standard SOM or HSOM, we do not assume that a distance preserving em-

bedding of the lattice into the two dimensional plane or another globally parame-terized two-dimensional manifold with global metric structure, such as the hyper-

bolic plane, exists. Rather, we assume that the distance of neurons within the grid

is computed directly on the neighborhood graph, which might be obtained by any

non-overlapping triangulation of the topological two-dimensional plane. 4 For our

experiments, we have implemented a grid generator for a circular triangle mesh-

ing around a center neuron, which requires the desired number of neurons and the

neighborhood degree n as parameters. Neurons at the lattice edge possess less thann neighbors, and if the chosen total number of neurons does not lead to filling upthe outer neuron circle, neurons there are connected to others in a maximum sym-

metric way. Figure 1 shows a small map with 7 neighbors for the inner neurons,and a total of 29 neurons perfectly filling up the outer edge. For 7 neighbors, theexponential neighborhood increase can be observed, for which an embedding into

4 Since the lattice is fixed during training, these values have to be computed only once.

12


13/30

N3

N2

N1

n

13

12

D1

D2

Fig. 1. Hyperbolic self organizing map with context. Neuron n refers to the context givenby the winner location in the map, indicated by the triangle of neurons N1, N2, and N3,

and the precise coordinates 12,13. If the previous winner has been D2, adaptation of the

context along the dotted line takes place.

the Euclidean plane is not possible without contortions; however, local projections

in terms of a fish eye magnification focus can be obtained (cf. [29]).

SOMSD adapts the location of the expected previous winner during training. For

this purpose, we have to embed the triangular mesh structure into a continuous

space. We achieve this by computing lattice distances beforehand, and then we ap-

proximate the distance of points within a triangle shaped map patch by the standard

Euclidean distance. Thus, positions in the lattice are represented by three neuron

indices which represent the selected triangle of adjacent neurons, and two real num-

bers which represent the position within the triangle. The recursive nature of the

shown map is illustrated exemplarily in figure 1 for neuron n. This neuron n isequipped with a weight w Rn and a context c that is given by a location withinthe triangle of neurons N1, N2, and N3 expressing corner affinities by means of

the linear combination parameters 12 and 13. The distance of a sequence s fromneuron n is recursively computed by

dSOM-S((s1, . . . , st), n) = s1 w2 + (1 ) g(CSOM-S(s2, . . . , sn), c).

CSOM-S(s) is the index of the neuron nj in the grid with smallest distance dSOM-S(s, nj).g measures the grid distance of the triangular position cj = (N1,N2,N3,12,13)to the winner as the shortest possible path in the mesh structure. Grid distances

between neighboring neurons possess unit length, and the metric structure within

the triangleN1,N2,N3 is approximated by the Euclidean metric. The range of gis normalized by scaling with the inverse maximum grid distance. This mixture of

hyperbolic grid distance and Euclidean distance is valid, because the hyperbolic

space can locally be approximated by Euclidean space, which is applied for com-

putational convenience to both distance calculation and update.

13


14/30

Training is carried out by presenting a pattern s = (s1, . . . , st), determining thewinner nj0 , and updating the weight and the context. Adaptation affects all neuronson the breadth first search graph around the winning neuron according to their

grid distances in a Hebbian style. Hence, for the sequence entry si, weight wj isupdated by wj = h(nhd(nj0, nj)) (si wj). The learning rate is typicallyexponentially decreased during training; as above, h(nhd(nj0, nj)) describes theinfluence of the winner nj0 to the current neuron nj as a decreasing function ofgrid distance. The context update is analogous: the current context, expressed in

terms of neuron triangle corners and coordinates, is moved towards the previous

winner along a shortest path. This adaptation yields positions on the grid only.

Intermediate positions can be achieved by interpolation: if two neurons Ni and Njexist in the triangle with the same distance, the midway is taken for the flat grids

obtained by our grid generator. This explains why the update path, depicted as the

dotted line in figure 1, for the current context towards D2 is via D1. Since the grid

distances are stored in a static matrix, a fast calculation of shortest path lengths is

possible. The parameter in the recursive distance calculations controls the balancebetween pattern and context influence; since initially nothing is known about the

temporal structure, this parameter starts at 1, thus indicating the absence of context,and resulting in standard SOM. During training it is decreased to an application

dependent value that mediates the balance between the externally presented patternand the internally gained model about historic contexts.

Thus, we can combine the flexibility of general triangular and possibly hyperbolic

lattice structures with the efficient context representation as proposed in [11].

4 Evaluation measures of SOM

Popular methods to evaluate the standard SOM are the visual inspection, the identi-

fication of meaningful clusters, the quantization error, and measures for topological

ordering of the map. For recursive self organizing maps, an additional dimension

arises: the temporal dynamic stored in the context representations of the map.

4.1 Temporal quantization error

Using ideas of Voegtlin [41] we introduce a method to assess the implicit repre-

sentation of temporal dependencies in the map, and to evaluate to which amount

faithful representation of the temporal data takes place. The general quantizationerror refers to the distortion of each map unit with respect to its receptive field,

which measures the extent of data space coverage by the units. If temporal data are

considered, the distortion needs to be assessed back in time. For a formal defini-

tion, assume that a time series (s1, s2, . . . , st, . . .) is presented to the network, again

14


15/30

with reverse indexing notation, i.e. s1 is the most recent entry of the time series. Letwini denote all time steps for which neuron i becomes the winner in the consideredrecursive map model. The mean activation of neuron i for time step t in the past isthe value

Ai(t) =

jwini

sj+t/|wini|.

Assume that neuron i becomes winner for a sequence entry sj . It can then be ex-pected that sj is like the standard SOM close to the average Ai(0), because the mapis trained with Hebbian learning. Temporal specification takes place if, in addition,

sj+t is close to the average Ai(t) for t > 0. The temporal quantization error ofneuron i at time step t back in the past is defined by

Ei(t) =

jwini

sj+t Ai(t)2

1/2

.

This measures the extent up to which the values observed t time steps back in the

past coincide with a winning neuron. Temporal specialization of neuron i takesplace if Ei(t) is small for t > 0. Since no temporal context is learned for thestandard SOM, the temporal quantization will be large for t > 0, just reflectingspecifics of the underlying time series such as smoothness or periodicity. For re-

cursive models, this quantity allows to assess the amount of temporal specification.

The temporal quantization error of the entire map for t time steps back into the pastis defined as the average

E(t) =Ni=1

Ei(t)/N

This method allows to evaluate whether the temporal dynamic in the recent past is

faithfully represented.

4.2 Temporal models

After the training of a recursive map, it can be used to obtain an explicit, possibly

approximative description of the underlying global temporal dynamics. This offers

another possibility to evaluate the dynamics of SOM because we can compare the

extracted temporal model to the original one, if available, or a temporal modelextracted directly from the data. In addition, a compressed description of the global

dynamics extracted from a trained SOM is interesting for data mining tasks. In

particular, it can be tested whether clustering properties of SOM, referred to by

U-matrix methods, transfer to the temporal domain.

15


16/30

Markov models constitute simple, though powerful techniques for sequence pro-

cessing and analysis [6,32]. Assume that = {a1, . . . , ad} is a finite alphabet. Theprediction of the next symbol refers to the task to anticipate the probability of aihaving observed a sequence s = (s1, . . . , st)

before. This is just the condi-

tional probability P(ai|s). For finite Markov models, a finite memory length l issufficient to determine this probability, i.e. the probability

P(ai|(s1, . . . , sl, . . . , st)) = P(ai|(s1, . . . , sl)) , (t l)

depends only on the past l symbols instead of the whole context (s1, . . . , st). Markovmodels can be estimated from given data if the order l is fixed. It holds that

P(ai|(s1, . . . , sl)) =P((ai, s1, . . . , sl))j P((aj, s1, . . . , sl))

(1)

which means that the next symbol probability can be estimated from the frequencies

of(l + 1)-grams.

We are interested in the question whether a trained SOM-S can capture the es-sential probabilities for predicting the next symbol, generated by simple Markov

models. For this purpose, we train maps on Markov models and afterwards extract

the transition probabilities entirely from the obtained maps. This extraction can be

done because of the specific form of context for SOM-S. Given a finite alphabet

= {a1, . . . , ad} for training, most neurons specialize during training and becomewinner for at least one or some stimuli. Winner neurons represent the input se-

quence entries w by their trained weight vectors. Usually, the weight wi of neuronni is very close to a symbol aj of and can thus be identified with the symbol.In addition, the neurons represent their context by an explicit reference to the lo-

cation of the winner in the previous time step. The context vectors stored in the

neurons define an intermediate winning position in the map encoded by the pa-

rameters (N1,N2,N3,12,13) for the closest three neurons and the exact positionwithin the triangle. We take this into account for extracting sequences correspond-

ing to the averaged weights of all three potential winners of the previous time step.

For the averaging, the contribution of each neuron to the interpolated position is

considered. Repeating this back-referencing procedure recursively for each winner

weighted by its influence, yields an exponentially spreading number of potentially

infinite time series for each of neuron. This way, we obtain a probability distribution

over time series that is representative for the history of each map neuron. 5

5 Interestingly, one can formally prove that every finite length Markov model can be ap-proximated by some map in this way in principle, i.e. for every Markov model of length la map exists such that the above extraction procedure yields the original model up to small

deviations. Assume a fixed length l and a rational P(ai|(s1, . . . , sl)) and denote by q thesmallest common denominator of the transition probabilities. Consider a map in which for

16


17/30

The number of specialized neurons for each time series is correlated to the proba-

bility of these stimuli in the original data source. Therefore, we can simply take the

mean of the probabilities for all neurons and obtain a global distribution over all

histories which are represented in the map. Since standard SOM has a magnification

factor different from 1, the number of neurons, which represent a symbol ai, devi-ates from the probability for ai in the given data [31]. This leads to a slightly biasedestimation of the sequence probabilities represented by the map. Nevertheless, we

will use the above extraction procedure as a sufficiently close approximation to the

true underlying distribution. This compromise is taken, because the magnification

factor for recurrent SOMs is not known and techniques from [31] for its compu-

tation cannot be transferred to recurrent models. Our experiments confirm that the

global trend is still correct. We have extracted for every finite memory length l theprobability distribution for words in l+1 as they are represented in the map anddetermined the transition probabilities of equation 1.

The method as described above is a valuable tool to evaluate the representation

capacity of SOM for temporal structures. Obviously, fixed order Markov models

can be better extracted directly from the given data, avoiding problems such as the

magnification factor of SOM. Hence, this method just serves as an alternative for

the evaluation of temporal self-organizing maps and their capability of representingtemporal dynamics. The situation is different if real-valued elements are processed,

like in the case of obtaining symbolic structure from noisy sequences. Then, a rea-

sonable quantization of the sequence entries must be found before a Markov model

can be extracted from the data. The standard SOM together with U-matrix methods

provides a valuable tool to find meaningful clusters in a given set of continuous

data. It is an interesting question whether this property transfers to the temporal

domain, i.e. whether meaningful clusters of real-valued sequence entries can also

be extracted from a trained recursive model. SOM-S allows to combine both reli-

able quantization of the sequence entries and the extraction mechanism for Markov

models to take into account the temporal structure of the data.

For the extraction we extend U-Matrix methods to recursive models as follows [38]:

the standard U-Matrix assigns to each neuron the averaged distance of its weight

vector compared to its direct lattice neighbors:

U(ni) =

nhd(ni,nj)=1

wi wj

each symbol ai a cluster of neurons with weights wj = ai exist. These main clusters aredivided into subclusters enumerated by s = (s1, . . . , sl)

l with q P(ai|s) neurons for

each possible s. The context of each of such neuron refers to another neuron within a clusterbelonging to s1 and to a subcluster belonging to (s2, . . . , sl, sl+1) for some arbitrary sl+1.Note that the clusters can thereby be chosen contiguous on the map respecting the topolog-

ical ordering of the neurons. The extraction mechanism leads to the original Markov model

(with rational probabilities) based on this map.

17


18/30

In a trained map, neurons spread in regions of the data space where a high sample

density can be observed, resulting in large U-values at borders between clusters.

Consequently, the U-Matrix forms a 3D landscape on the lattice of neurons with

valleys corresponding to meaningful clusters and hills at the cluster borders. The

U-Matrix of weight vectors can be constructed also for SOM-S. Based on this ma-

trix, the sequence entries can be clustered into meaningful categories, based on

which the extraction of Markov models as described above is possible. Note that

the U-Matrix is built by using the weights assigned to the neurons only, while the

context information of SOM-S is yet ignored. 6 However, since context informa-

tion is used for training, clusters emerge which are meaningful with respect to the

temporal structure, and this way they contribute implicitly to the topological order-

ing of the map and to the U-Matrix. Partially overlapping, noisy, and ambiguous

input elements are separated during the training, because the different temporal

contexts contain enough information to activate and produce characteristic clusters

on the map. Thus, the temporal structure captured by the training allows a reliable

reconstruction of the input sequences, which could not have been achieved by the

standard SOM architecture.

5 Experiments

5.1 Mackey-Glass time series

The first task is to learn the dynamic of the real-valued chaotic Mackey-Glass time

series dxd

= bx() + ax(d)1+x(d)10

using a = 0.2, b = 0.1, d = 17. This is the same

setup as given in [41] making a comparison of the results possible. 7 Three types

of maps with 100 neurons have been trained: a 6-neighbor map without contextgiving standard SOM, a map with 6 neighbors and with context (SOM-S), and

a 7-neighbor map providing a hyperbolic grid with context utilization (H-SOM-S). Each run has been computed with 1.5 105 presentations starting at randompositions within the Mackey-Glass series using a sample period of t = 3; theneuron weights have been initialized white within [0.6, 1.4]. The context has beenconsidered by decreasing the parameter from = 1 to = 0.97. The learning rateis exponentially decreased from 0.1 to 0.005 for weight and context update. Initialneighborhood cooperativity is 10 which is annealed to 1 during training.

Figure 2 shows the temporal quantization error for the above setups: the temporal

quantization error is expressed by the average standard deviation of the given se-

quence and the mean unit receptive field for 29 time steps into the past. Similar

6 Preliminary experiments indicate that the context also orders topologically and yields

meaningful clusters. The number of neurons in context clusters is thereby small compared

to the number of neurons and statistically significant results could not be obtained.7 We would like to thank T.Voegtlin for providing data for comparison.

18


19/30

to Voegtlins results, we observe large cyclic oscillations driven by the periodicity

of the training series for standard SOM. Since SOM does not take contextual in-

formation into account, this quantization result can be seen as an upper bound for

temporal models, at least for the indices > 0 reaching into the past (trivially, SOMis a very good quantizer of scalar elements without history); the oscillating shape

of the curve is explained by the continuity of the series and its quasi-periodic dy-

namic, and extrema exist rather by the nature of the series than by special model

properties. Obviously, the very restricted context of RSOM does not yield a long

term improvement of the temporal quantization error. However, the displayed er-

ror periodicity is anti-cyclic compared to the original series. Interestingly, the data

optimum topology of neural gas (NG), which also does not take contextual infor-

mation into account, allows a reduction of the overall quantization error; however,

the main characteristics, such as the periodicity, remain the same as for standard

SOM. RecSOM leads to a much better quantization error than RSOM and also NG.

Thereby, the error is minimum for the immediate past (left side of the diagram),

and increases for going back in time, which is reasonable because of the weighting

of context influence by (1 ). The increase of the quantization error is smoothand the final values after 29 time steps is better than the default given by standardSOM. In addition, almost no periodicity can be observed for RecSOM. SOM-S

and H-SOM-S further improve the results: only some periodicity can be observed,and the overall quantization error increases smoothly for the past values. Note that

these models are superior to RecSOM in this task while requiring less computa-

tional power. H-SOM-S allows a slightly better representation of the immediate

past compared to SOM-S due to the hyperbolic topology of the lattice structure

that matches better the characteristics of the input data.

0

0.05

0.1

0.15

0.2

0 5 10 15 20 25 30

QuantizationError

Index of past inputs (index 0: present)

* SOM* RSOM

NG* RecSOM

H-SOM-SSOM-S

Fig. 2. Temporal quantization errors of different model setups for the Mackey-Glass series.

Results indicated by are taken from [41].

19


20/30

5.2 Binary automata

The second experiment is also inspired by Voegtlin. A discrete 0/1-sequence gener-

ated by a binary automaton with P(0|1) = 0.4 and P(1|0) = 0.3 shall be learned.For discrete data, the specialization of a neuron can be defined as the longest se-

quence that still leads to unambiguous winner selection. A high percentage of spe-

cialized neurons indicates that temporal context has been learned by the map. In

addition, one can compare the distribution of specializations with the original dis-tribution of strings as generated by the underlying probability. Figure 3 shows the

specialization of a trained H-SOM-S. Training has been carried out with 3 106 pre-sentations, increasing the context influence (1 ) exponentially from 0 to 0.06.The remaining parameters have been chosen as in the first experiment. Finally, the

receptive field has been computed by providing an additional number of 106 testiterations. Putting more emphasis on the context results in a smaller number of ac-

tive neurons representing rather long strings that cover only a small part of the total

input space. If a Euclidean lattice is used instead of a hyperbolic neighborhood,

the resulting quantizers differ only slightly, which indicates that the representation

of binary symbols and their contexts in the 2-dimensional output space represen-

tations does barely benefit from exponential branching. In the depicted run, 64 ofthe neurons express a clear profile, whereas the other neurons are located at sparse

locations of the input data topology, between cluster boundaries, and thus do not

win for the presented stimuli. The distribution corresponds nicely to the 100 mostcharacteristic sequences of the probabilistic automaton as indicated by the graph.

Unlike RecSOM (presented in [41]), also neurons at interior nodes of the tree are

expressed for H-SOM-S. These nodes refer to transient states, which are repre-

sented by corresponding winners in the network. RecSOM, in contrast to SOM-S,

does not rely on the winner index only, but it uses a more complex representa-

tion: since the transient states are spared, longer sequences can be expressed by

RecSOM. In addition to the examination of neuron specialization, the whole map

01

23456789

1011

100 most likely sequencesH-SOM-S, 100 neurons64 specialized neurons

Fig. 3. Receptive fields of a H-SOM-S compared to the most probable sub-sequences of the

binary automaton. Left hand branches denote 0, right is 1.

20


21/30

Type P(0) P(1) P(0|0) P(1|0) P(0|1) P(1|1)

Automaton 1 4/7 0.571 3/7 0.429 0.7 0.3 0.4 0.6

Map (98/100) 0.571 0.429 0.732 0.268 0.366 0.634

Automaton 2 2/7 0.286 5/7 0.714 0.8 0.2 0.08 0.92

Map (138/141) 0.297 0.703 0.75 0.25 0.12 0.88

Automaton 3 0.5 0.5 0.5 0.5 0.5 0.5

Map (138/141) 0.507 0.493 0.508 0.492 0.529 0.471

Table 1

Results for binary automata extraction with different transition probabilities. The extracted

probabilities clearly follow the original ones.

representation can be characterized by comparing the input symbol transition statis-

tics with the learned context-neuron relations. While the current symbol is coded

by the winning neurons weight, the previous symbol is represented by the average

of weights of the winners context triangle neurons. The obtained two values the

neurons state and the average state of the neurons context are clearly expressed

in the trained map: only few neurons contain values in an indeterminate interval

[13

, 23

], but most neurons specialize on very close to 0 or 1. Results for the recon-struction of three automata can be found in table 1. For the reconstruction we have

used the algorithm described in section 4.2 with memory length 1. The left columnindicates the number of expressed neurons and the total number of neurons in the

map. Note that the automata can be well reobtained from the trained maps. Again,

the temporal dependencies are clearly captured by the maps.

5.3 Reber grammar

In a third experiment we have used more structured symbolic sequences as gener-

ated by the Reber grammar illustrated in figure 4. The 7 symbols have been coded

in a 6-dimensional Euclidean space by points that denote the same as a tetrahedron

does with its four corners in three dimensions: all points have the same distance

*

8

8

2:

:

2

56

5

6

-

Fig. 4. State graph of the Reber grammar.

21


22/30

from each other. For training and testing we have taken the concatenation of ran-

domly generated words, such preparing sequences of 3 106 and 106 input vectors,respectively. The map has got a map radius of 5 and contains 617 neurons on anhyperbolic grid. For the initialization and the training, the same parameters as in the

previous experiment were used, except for an initially larger neighborhood range of

14, corresponding to the larger map. Context influence was taken into account bydecreasing from 1 to 0.8 during training. A number of 338 neurons developed aspecialization for Reber strings with an average length of7.23 characters. Figure 5shows that the neuron specializations produce strict clusters on the circular grid,

ordered in a topological way by the last character. In agreement with the grammar,

the letter T takes the largest sector on the map. The underlying hyperbolic lattice

gives rise to sectors, because they clearly minimize the boundary between the 7

classes. The symbol separation is further emphasized by the existence of idle neu-

rons between the boundaries, which can be seen analogously to large values in a

U-Matrix. Since neuron specialization takes place from the most common states

which are the 7 root symbols to the increasingly special cases, the central nodes

have fallen idle after they have served as signposts during training; finally the most

specialized nodes with their associated strings are found at the lattice edge on the

outer ring. Much in contrast to the such ordered hyperbolic target lattice, the re-

sult for the Euclidean grid in figure 7 shows a neuron arrangement in the form ofpolymorphic coherent patches.

Similar to the binary automata learning tasks, we have analyzed the map represen-

tation by the reconstruction of the trained data by backtracking all possible context

sequences of each neuron up to length 3. Only 118 of all 343 combinatorially pos-sible trigrams are realized. In a ranked table the most likely 33 strings cover allattainable Reber trigrams. In the log-probability plot 6 there is a leap between entry

number 33 (TSS, valid) and 34 (XSX, invalid), emphasizing the presence of the Re-ber characteristic. The correlation of the probabilities of Reber trigrams and their

relative frequencies found in the map is 0.75. An explicit comparison of the proba-

bilities of valid Reber strings can be found in figure 8. The values deviate from thetrue probabilities, in particular for cycles of the Reber graph, such as consecutive

letters T and S, or the VPX-circle. This effect is due to the magnification factor

different from 1 for SOM, which further magnifies when sequences are processedin the proposed recursive manner.

5.4 Finite memory models

In a final series of experiments, we examine a SOM-S trained on Markov modelswith noisy input sequence entries. We investigate the possibility to extract tempo-

ral dependencies on real-valued sequences from a trained map. The Markov model

possesses a memory length of2 as depicted in figure 9. The basic symbols are de-noted by a, b, and c. These are embedded in two dimensions, disrupted by noise, as

22


23/30


24/30

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

* *

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

***

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

XTVVEB

SEBPV

EBTXS

XXTVVEB

EBPV

TVPS

TVPSEB

VPSEB

EBTXXT

TVVEBP

EBTXXT

EBP

TVPXT

TVVEBTX

VPSEBTX

VPXTTVPX

EBPTTV

XTTV

VVEBP

TTVVEBP

VPXTTVV

TTTTVVSSXS

TVVEBT

SEBTX

TVPSEBT

EBTX

VPSEBT

VPX

EBTXXTT

XTT

EBTSSXXTVP

TVPXTT

TVP

TVVE

EBTSXSEB

EBTXSE

VVEBTXX

TTV

EBTSSXXTVPSE

EBTXX

EBPVVEBP

SSXX

EBTSX

TTTTVVEBP

VPXTTVPXT

TTTV

TVPXT

TTTTV

EBTSS

EBTXSEBT

EBTXS

EBTXXTTVV

BTSSXXTVPSEBPVVPSEBP

TTVVEBPV

EBPVVE

TTTTVVE

EBTSSXX

VPXTTVVE

VPXTTVVEBTXX

EBTSXX

TVPS

EBTSSXXTVPS

TVPX

VVEBTS

SSX

EBTXXSEBTS

TTVP

EBTSSX

SEBTXX

TTTTVP

TVPXT

EBPVVEB

TVP

VVEBTSX

TTTTVVEB

SSXSEBT

TVVEBPTV

EBTSSXXTVPSEBP

EBTSXSEBT

EBTXSEBP

VVEBPTT

EBP

EBPTT

TTVVEBT

EBTSSXXTT

EBPVVPS

TVPXTT

SSXXTV

VVEBPV

TTVPS

TTT

XTV

TVPXTV

EBPVVE

TTVVEBTS

XTTVVE

EBTS

EBTXXTTV

TTVVE

EBTXSEBTS

TTVV

XTVV

TVVE

SEBPVV

EBPVV

XTVVE

VVEB

EBPTVPXT

TSSXXTVPSEBPVP

XXTVVE

TTVVEB

SEBPVP

TVVEB

TVPX

SSXSEEBTSSXSE

VVEBPTTVVEBPTEBPT VPXT

VPXTTVVEBT

TSSXXTVPSEBPVV

EBPVV

EBTXSEBT

EBPVP

EBPVP

EBPVPX

EBTSXSE

SSXSEB

TTVVEBPTEBPT

VPXTTV

EBTSSXXT

EBTSXS

EBPVPXT

EBTSSXS

TVVEBT

TTVV

XTTVV

VVEBPTV

EBPTV

VVEBPVV

TVV

EBTSSXXTVPSEBT

TVVEB

SEBPTT

XXTT

EBPTTVP

EBTXSEB

VPXTT

TVPXTVP

TTT

TVV

SXSEBP

TVPXTTTT

TVVE

VPXTTVP

TTVP

EBTXSE

TVPEBPTVP

VVEBT

VPXTTVVEB

EBTXXTV

EBTSSXXTV

VVEBTSS

EBTXSEB

EBPVPXTV

EBTSSXXTVPSEB

SSXXTVV

XXTVV

EBTX EBPVPXTVV

SEBPT

VPXTTVVEBTX

VVEBTX

XTTT

SSXXT

EBTSSXXTVPX

TVPXTTT

TVPSE

TVPX

TTTTTEBPTVPX

TTTT

VPSE

TTV

SSSS

EBTSSS

Fig. 7. Arrangement of Reber words on a Euclidean lattice structure. The words are ar-

ranged according to their most recent symbols (shown on the right of the sequences).

Patches emerge according to the most recent symbol. Within the patches, an ordering ac-

cording to the preceding symbols can be observed.

C

Fig. 8. Frequency reconstruction of trigrams from the Reber grammar.

24


25/30

follows: a stands for (0, 0) + , b for (1, 0) + , and c for (0, 1) + , being inde-pendent Gaussian noise with standard deviation g, which is a variable to be testedin the experiments. The symbols are denoted right to left, i.e. ab indicates that thecurrently emitted symbol is a, after having observed symbol b in the previous step.Thus, b and c are always succeeded by a, whereas a is succeeded with probabilityx by b, and (1 x) by c assumed the past symbol was b, and vice versa, if thelast symbol was c. The transition probability x is varied between the experiments.We train a SOM-S with regular rectangular two-dimensional lattice structure and

100 neurons for a generated Markov series. The context parameter was decreasedfrom = 0.97 to = 0.93, the neighborhood radius was decreased from = 5to = 0.5, the learning rate was annealed from 0.02 to 0.005. A number of 1000patterns are presented in 15000 cycles. U-Matrix clustering has been calculatedwith such a level of the landscape that half the neurons are contained in valleys.

The neurons in the same valleys are assigned to belong to the same cluster, and the

number of different clusters is determined. Afterwards, all the remaining neurons

are assigned to their closest cluster.

First, we choose a noise level of g = 0.1 such that almost no overlap can beobserved, and we investigate this setup with different x between 0 and 0.8. In all

the results, three distinct clusters, corresponding to the three symbols, are foundwith the U-Matrix method. The extraction of the order 2 Markov models indicatesthat the global transition probabilities are correctly represented in the maps.Table 2

shows the corresponding extracted probabilities. Thereby, the exact probabilities

cannot be recovered because of a magnification factor of SOM different from 1.However, the global trend is clearly found and the extracted probabilities are in

good agreement with the priorly chosen values.

In a second experiment, the transition probability is fixed to x = 0.4, but the noiselevel is modified, choosing g between 0.1 and 0.5. All the training parameters arechosen as in the previous experiment. Note that a noise level g = 0.3 already yields

much overlap of the classes, as depicted in figure 10. Nevertheless, three clusterscan be detected in all of the cases and the transition probabilities can be recovered,

except for a noise level of 0.5 for which the training scenario degenerates to analmost deterministic case, making a the most dominant state. Table 3 summarizesthe extracted probabilities.

1

ac

caba

ab

1x 1x

1x x

Fig. 9. Markov automaton with 3 basic states and a finite order of2 used to train the map.

25


26/30

Fig. 10. Symbols a, b, c which are embedded in R2 as a = (0, 0) + , b = (1, 0) + , andc = (0, 1) + , subject to noise with different variances: noise level are 0.1, 0.3, and 0.4.The latter two noise levels show considerable overlap of the classes which represent the

symbol.

x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

P(a|ab) 0 0.01 0 0.01 0 0.04 0 0.04 0.01

P(b|ab) 0 0.08 0.3 0.31 0.38 0.55 0.68 0.66 0.78

P(c|ab) 1 0.91 0.7 0.68 0.62 0.41 0.32 0.3 0.21

P(a|ac) 0 0 0 0 0 0.01 0.01 0 0.01

P(b|ac) 1 0.81 0.8 0.66 0.52 0.55 0.32 0.31 0.24

P(c|ac) 0 0.19 0.2 0.34 0.48 0.44 0.67 0.69 0.75

Table 2

Transition probabilities extracted from the trained map. The noise level was fixed to 0.1and different generating transition probabilities x were used.

noise 0.1 0.2 0.3 0.4 0.5 true

P(a|ab) 0.01 0 0 0.1 0.98 0

P(b|ab) 0.42 0.49 0.4 0.24 0.02 0.4

P(c|ab) 0.57 0.51 0.6 0.66 0.02 0.6

P(a|ac) 0.01 0 0 0.09 0 0

P(b|ac) 0.59 0.6 0.44 0.39 0 0.6

P(c|ac) 0.4 0.4 0.56 0.52 0 0.4

Table 3

Probabilities extracted from the trained map with fixed input transition probabilities and

different noise levels. For a noise level of0.5

, the extraction mechanism breaks down and

the symbol a becomes most dominant. For smaller noise levels, extraction of the symbolscan still be done also for overlapping clusters because of temporal differentiation of the

clusters in recursive models.

26


27/30

6 Conclusions

We have presented a self organizing map with a neural back-reference to the pre-

viously active sites and with a flexible topological structure of the neuron grid. For

context representation, the compact and powerful SOMSD model as proposed in

[11] has been used. Unlike TKM and RSOM, much more flexibility and expres-

siveness is obtained, because the context is represented in the space spanned by

the neurons, and not only in the domain of the weight space. Compared to Rec-

SOM, which is based on very extensive contexts, the SOMSD model is much moreefficient. However, SOMSD requires an appropriate topological representation of

the symbols, measuring distances of contexts in the grid space. We have therefore

extended the map configuration to more general triangular lattices, thus, making

also hyperbolic models possible as introduced in [30]. Our SOM-S approach has

been evaluated on several data series including discrete and real-valued entries.

Two experimental setups have been taken from [41] to allow a direct comparison

with different models. As pointed out, the compact model introduced here improves

the capacity of simple leaky integrator networks like TKM and RSOM and shows

results competitive to the more complex RecSOM.

Since the context of SOM-S directly refers to the previous winner, temporal con-texts can be extracted from a trained map. An extraction scheme to obtain Markov

models of fixed order has been presented and its reliability has been confirmed in

three experiments. As demonstrated, this mechanism can be applied to real-valued

sequences, expanding U-Matrix methods to the recursive case.

So far, the topological structure of context formation has not been taken into ac-

count during the extraction. Context clusters, in addition to weight clusters, provide

more information, which might be used for the determination of appropriate orders

of the models, or for the extraction of more complex settings like hidden Markov

models. We currently investigate experiments aiming at these issues. However, pre-

liminary results indicate that Hebbian training, as introduced in this article, allowsthe reliable extraction of finite memory models only. More sophisticated training

algorithms should be developed for more complex temporal dependencies.

Interestingly, the proposed context model can be interpreted as the development

of long range synaptic connections, leading to more specialized map regions. Sta-

tistical counterparts to unsupervised sequence processing, like the Generative To-

pographic Mapping Through Time (GTMTT) [5], incorporate similar ideas by de-

scribing temporal data dependencies by hidden Markov latent space models. Such a

context effects the prior distribution on the space of neurons. Due to computational

restrictions, the transition probabilities of GTMTT are usually limited to only lo-

cal connections. Thus, long range connections like in the presented context modeldo not emerge, rather visualizations similar (though more powerful) to TKM and

RSOM arise. It could be interesting to develop more efficient statistical counter-

parts, which also allow the emergence of interpretable long range connections such

as those of the deterministic SOM-S.

27


28/30

References

[1] G. Barreto and A. Araujo. Time in self-organizing maps: An overview of models. Int.Journ. of Computer Research, 10(2):139179, 2001.

[2] G. de A. Barreto, A. F. R. Araujo, and S. C. Kremer. A taxonomy for spatiotemporalconnectionist networks revisited: the unsupervised case. Neural Computation,15(6):1255 - 1320, 2003.

[3] H.-U. Bauer and T. Villmann. Growing a hypercubical output space in a self-

organizing feature map. IEEE Transactions on Neural Networks, 8(2):218226, 1997.

[4] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: the generative topographicmapping. Neural Computation 10(1):215-235, 1998.

[5] C. M. Bishop, G. E. Hinton, and C. K. I. Williams. GTM through time. ProceedingsIEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K.,pages 111-116, 1997.

[6] Buhlmann, P., and Wyner, A.J. (1999). Variable length Markov chains. Annals ofStatistics, 27:480-513.

[7] O. A. Carpinteiro. A hierarchical self-organizing map for sequence recognition.Neural Processing Letters, 9(3):209-220, 1999.

[8] G. Chappell and J. Taylor. The temporal Kohonen map. Neural Networks, 6:441445,1993.

[9] I. Farkas and R. Miikkulainen. Modeling the self-organization of directionalselectivity in the primary visual cortex. Proceedings of ICANN99, Edinburgh,Scotland, pp. 251-256, 1999.

[10] M. Hagenbuchner, A. C. Tsoi, and A. Sperduti. A supervised self-organising map forstructured data. In N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances inSelf-Organising Maps, 2128. Springer, 2001.

[11] M. Hagenbuchner, A. Sperduti, and A.C. Tsoi. A Self-Organizing Map for AdaptiveProcessing of Structured Data. IEEE Transactions on Neural Networks, 14(3):491505, 2003.

[12] B. Hammer. On the learnability of recursive data. Mathematics of Control Signals andSystems, 12:6279, 1999.

[13] B. Hammer, A. Micheli, and A. Sperduti. A general framework for unsupervisedprocessing of structured data. In M. Verleysen, editor, European Symposium onArtificial Neural Networks2002, 389394. D Facto, 2002.

[14] B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A general framework forunsupervised processing of structured data. To appear in: Neurocomputing.

[15] B. Hammer, A. Micheli, A. Sperduti. A general framework for self-organizingstructure processing neural networks. Technical report TR-03-04 of the Universitadi Pisa, 2003.

[16] J. Joutsensalo and A. Miettinen. Self-organizing operator map for nonlinear dimensionreduction. Proceedings ICNN95, 1:111-114, IEEE, 1995.

[17] J. Kangas. On the analysis of pattern sequences by self-organizing maps. PhD thesis,Helsinki University of Technology, Espoo, Finland, 1994.

28


29/30

[18] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM self-organizing mapsof document collections. Neurocomputing, 21(1):101-117, 1998.

[19] S. Kaski and J. Sinkkonen. A topography-preserving latent variable model withlearning metrics. In: N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advancesin Self-Organizing Maps, pages 224229, Springer, 2001.

[20] T. Kohonen. The neural phonetic typewriter. Computer, 21(3):11-22, 1988.

[21] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 2001.

[22] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Recurrent SOM with local linear

models in time series prediction. In M.Verleysen, editor, 6th European Symposium onArtificial Neural Networks,pages 167172, De facto, 1998.

[23] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Time series prediction usingrecurrent SOM with local linear models. International Journal of Knowledge-basedIntelligent Engineering Systens 2(1):60-68, 1998.

[24] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review.Neural Computation, 13(2):249306, 2001.

[25] J. Laaksonen, M. Koskela, S. Laakso, and E. Oja. PicSOM content-based imageretrieval with self-organizing maps. Pattern Recognition Letters, 21(13-14):1199-1207, 2000.

[26] J. Lampinen and E. Oja. Self-organizing maps for spatial and temporal AR models.M. Pietikainen and J. Roning (eds.), Proceedings 6 SCIA, 120-127, Helsinki, Finland,1989.

[27] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507522, 1994.

[28] T. Martinetz, S.G. Berkovich, and K.J. Schulten. Neural-gas networks for vectorquantization and its application to time-series prediction. IEEE Transactions onNeural Networks, 4(4):558569, 1993.

[29] J. Ontrup and H. Ritter. Text categorization and semantic browsing with self-organizing maps on non-euclidean spaces. In L. D. Raedt and A. Siebes, editors,Proceedings of PKDD-01, 338349. Springer, 2001.

[30] H. Ritter. Self-organizing maps on non-Euclidian spaces. In: E. Oja and S. Kaski,editors, Kohonen Maps, pages 97110. Elsevier, 1999.

[31] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-OrganizingMaps: An Introduction, Addison-Wesley, 1992.

[32] Ron, D., Singer, Y., and Tishby, N. (1996). The power of amnesia. Machine Learning,25:117-150.

[33] J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in anauxiliary space. Neural Computation, 14:217239, 2002.

[34] P. Sommervuo. Self-organizing maps for signal and symbokl sequences, PhD thesis,Helsinki University of Technology, 2000.

[35] A. Sperduti. Neural networks for adaptive processing of structured data. In Proc.ICANN 2001, 512. Springer, 2001.

[36] M. Strickert, T. Bojer, and B. Hammer. Generalized relevance LVQ for time series. InProc. ICANN2001, 677638. Springer, 2001.

29


30/30

[37] M. Strickert and B. Hammer. Neural Gas for Sequences. In Proc. WSOM03, 53-57,2003.

[38] A. Ultsch and C. Vetter. Selforganizing Feature Maps versus Statistical Clustering:A Benchmark. Research Report No. 9, Dep. of Mathematics, University of Marburg1994.

[39] M. Varsta, J. del R. Milan, and J. Heikkonen. A recurrent self-organizing map fortemporal sequence processing. In Proc. ICANN97, 421426. Springer, 1997.

[40] M. Varsta, J. Heikkonen, and J. Lampinen. Analytical comparison of the temporal

Kohonen map and the recurrent self organizing map. M. Verleysen (ed.),ESANN2000, pages 273-280, De Facto, 2000.

[41] T. Voegtlin. Recursive self-organizing maps. Neur.Netw., 15(8-9):979991, 2002.

30

Documents

Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing