Upload
grettsz
View
219
Download
0
Embed Size (px)
Citation preview
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
1/30
Unsupervised Recursive Sequence Processing
Marc Strickert, Barbara Hammer
Research group LNM, Department of Mathematics/Computer Science,
University of Osnabr uck, Germany
Sebastian Blohm
Institute for Cognitive Science,
University of Osnabr uck, Germany
Abstract
The self organizing map (SOM) is a valuable tool for data visualization and data mining for
potentially high dimensional data of an a priori fixed dimensionality. We investigate SOMsfor sequences and propose the SOM-S architecture for sequential data. Sequences of poten-
tially infinite length are recursively processed by integrating the currently presented item
and the recent map activation, as proposed in [11]. We combine that approach with the
hyperbolic neighborhood of Ritter [29], in order to account for the representation of pos-
sibly exponentially increasing sequence diversification over time. Discrete and real-valued
sequences can be processed efficiently with this method, as we will show in experiments.
Temporal dependencies can be reliably extracted from a trained SOM. U-Matrix methods,
adapted to sequence processing SOMs, allow the detection of clusters also for real-valued
sequence elements.
Key words: Self-organizing map, sequence processing, recurrent models, hyperbolicSOM, U-Matrix, Markov models
1 Introduction
Unsupervised clustering by means of the self organizing map (SOM) was first pro-
posed by Kohonen [21]. The SOM makes the exploration of high dimensional data
possible and it allows the exploration of the topological data structure. By SOMtraining, the data space is mapped to a typically two dimensional Euclidean grid
Email address: {marc,hammer}@informatik.uni-osnabrueck.de(Marc Strickert, Barbara Hammer).
Preprint submitted to Elsevier Science 23 January 2004
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
2/30
of neurons, preferably in a topology preserving manner. Prominent applications of
the SOM are WEBSOM for the retrieval of text documents and PicSOM for the
recovery and ordering of pictures [18,25]. Various alternatives and extensions to
the standard SOM exist, such as statistical models, growing networks, alternative
lattice structures, or adaptive metrics [3,4,19,27,28,30,33].
If temporal or spatial data are dealt with like time series, language data, or DNA
strings sequences of potentially unrestricted length constitute a natural domain for
data analysis and classification. Unfortunately, the temporal scope is unknown in
most cases, and therefore fixed vector dimensions, as used for standard SOM, can-
not be applied. Several extensions of SOM to sequences have been proposed; for
instance, time-window techniques or the data representation by statistical features
make a processing with standard methods possible [21,28]. Due to data selection or
preprocessing, information might get lost; for this reason, a data-driven adaptation
of the metric or the grid is strongly advisable [29,33,36]. The first widely used ap-
plication of SOM in sequence processing employed the temporal trajectory of the
best matching units of a standard SOM in order to visualize speech signals and the
variations of which [20]. This approach, however, does not operate on sequences
as they are; rather, SOM is used for reducing the dimensionality of single sequence
entries and acts as a preprocessing mechanism this way. Proposed alternatives sub-stitute the standard Euclidean metric by similarity operators on sequences by in-
corporating autoregressive processes or time warping strategies [16,26,34]. These
methods are very powerful, but a major problem is their computational costs.
A fundamental way for sequence processing is a recursive approach. Supervised
recurrent networks constitute a well-established generalization of standard feedfor-
ward networks to time series; many successful applications for different sequence
classification and regression tasks are known [12,24]. Recurrent unsupervised mod-
els have also been proposed: the temporal Kohonen map (TKM) and the recurrent
SOM (RSOM) use the biologically plausible dynamics of leaky integrators [8,39],
as they occur in organisms, and explain phenomena such as direction selectivityin the visual cortex [9]. Furthermore, the models have been applied with moderate
success to learning tasks [22]. Better results have been achieved by integrating these
models into more complex systems [7,17]. Recent more powerful approaches are
the recursive SOM (RecSOM) and the SOM for structured data (SOMSD) [10,41].
These are based on a richer and explicit representation of the temporal context: they
use the activation profile of the entire map or the index of the most recent winner.
As a result, their representation ability is superior to RSOM and TKM.
A proposal to put existing unsupervised recursive models into a taxonomy can be
found in [1,2]. The latter article identifies the entity time context used by the mod-els as one of the main branches of the given taxonomy [2]. Although more general,
the models are still quite diverse, and the recent developments of [10,11,35] are
not included in the taxonomy. An earlier, simple, and elegant general description of
recurrent models with an explicit notion of context has been introduced in [13,14].
2
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
3/30
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
4/30
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
5/30
place by the update rule
wj = h(nhd(nj0, nj)) (si wj)
whereby (0, 1) is the learning rate. The function h describes the amount ofneuron adaptation in the neighborhood of the winner: often the Gaussian bell func-
tion h(x) = exp(x2/2) is chosen, of which the shape is narrowed during train-
ing by decreasing to ensure the neuron specialization. The function nhd(nj, nk)which measures the degree of neighborhood of the neurons ni and nj within thelattice might be induced by the simple Euclidean distance between the neuron co-
ordinates in a rectangular grid or by the shortest distance in a graph connecting the
two neurons.
Recursive models substitute the one-shot distance computation for a single entry
si by a recursive formula over all entries of a given sequence s. For all models,sequences are presented recursively, and the current sequence entry si is processedin the context which is set by its predecessors si+1, si+2, . . ..
2 The models differ
with respect to the representation of the context and in the way that the context
influences further computation.
The Temporal Kohonen Map (TKM) computes the distance of s = (s1, . . . , st)from neuron nj labeled with wj R
n by the leaky integration
dTKM(s, nj) =t
i=1
(1 )i1si wj2
where (0, 1) is a memory parameter [8]. A neuron becomes winner if thecurrent entry s1 is close to its weight wj as in standard SOM, and, in addition,
the remaining sum (1 )s2 wj + (1 )2s3 wj + . . . is also small.This additional term integrates the distances of the neurons weight from previous
sequence entries weighted by an exponentially decreasing decay factor (1 )i1.The context resulting from previous sequence entries is pointing towards neurons
of which the weights have been close to previous entries. Thus, the winner is a
neuron whose weight is close to the average presented signal for the recent time
steps.
The training for the TKM takes place by Hebbian learning in the same way as for
the standard SOM, making well-matching neurons more similar to the current input
than bad-matching neurons. At the beginning, weights wj are initialized randomlyand then iteratively adapted when data is presented. For adaptation assume that a
2 We use reverse indexing of the sequence entries, s1 denoting the most recent entry,s2, s3, . . . its predecessors.
5
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
6/30
sequence s is given, with si denoting the current entry and nj0 denoting the bestmatching neuron for this time step. Then the weight correction term is
wj = h(nhd(nj0, nj)) (si wj)
As discussed in [23], the learning rule of TKM is unstable and leads to only subop-
timal results. More advanced, the Recurrent SOM (RSOM) leaky integration first
sums up the weighted directions and afterwards computes the distance [39]
dRSOM(s, nj) =
t
i=1
(1 )i1(si wj)
2
.
It represents the context in a larger space than TKM since the vectors of directions
are stored instead of the scalar Euclidean distance. More importantly, the training
rule is changed. RSOM derives its learning rule directly from the objective to min-
imize the distortion error on sequences and thus adapts the weights towards the
vector of integrated directions:
wj
= h
(nhd(nj0, n
j)) y
j(i)
whereby
yj(i) =t
i=1
(1 )i1(si wj) .
Again, the already processed part of the sequence produces a context notion, and
the neuron becomes the winner for the current entry of which the weight is most
similar to the average entry for the past time steps. The training rule of RSOM takes
this fact into account by adapting the weights towards this averaged activation.We will not refer to this learning rule in the following. Instead, the way in which
sequences are represented within these two models, and the ways to improve the
representational capabilities of such maps will be of interest.
Assuming vanishing neighborhood influences for both cases TKM and RSOM,one can analytically compute the internal representation of sequences for these two
models, i.e. weights with response optimum to a given sequence s = (s1, . . . , st):the weight w is optimum for which
w =t
i=1
(1
)i1si/
t
i=1
(1
)i1
holds [40]. This explains the encoding scheme of the winner-takes-all dynamics
of TKM and RSOM. Sequences are encoded in the weight space by providing a
6
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
7/30
recursive partitioning very much like the one generating fractal Cantor sets. As an
example for explaining this encoding scheme, assume that binary sequences {0,1}l
are dealt with. For = 0.5, the representation of sequences of fixed length l cor-responds to an encoding in a Cantor set: the interval [0, 0.5) represents sequenceswith most recent entry s1 = 0, interval [0.5, 1) contains only codes of sequenceswith most recent entry 1. Recursive decomposition of the intervals allows to re-
cover further entries of the sequence: [0, 0.25) stands for the beginning 00. . . of asequence, [0.25, 0.5) stands for 01, [0.5, 0.75) for 10, and [0.75, 1) represents 11.By further subdivision, [0, 0.125) stands for the beginning 000. . ., [0.125, 0.25) for001, and so on. Similar encodings can be found for alternative choices of . Se-quences over discrete sets = {0, . . . , d} R can be uniquely encoded usingthis fractal partitioning if < 1/d. For larger , the subsets start to overlap, i.e.codes are no longer sorted according to their last symbols, and a code might stand
for two or more different sequences. A very small 1/d, in turn, results in anonly sparsely used space; for example the interval (d , 1] does not contain a validcode. Note that the explicit computation of this encoding stresses the superiority
of the RSOM learning rule compared to TKM update, as pointed out in [40]: the
fractal code is a fixed point for the dynamics of RSOM training, whereas TKM
converges towards the borders of the intervals, preventing the optimum fractal en-
coding scheme from developing on its own.
Fractal encoding is reasonable, but limited: it is obviously restricted to discrete
sequence entries, and real values or noise might destroy the encoded information.
Fractal codes do not differentiate between sequences of different length; e.g. the
code 0 gives optimum response to 0,00, 000, and so forth. Sequences with thiskind of encoding cannot be distinguished. In addition, the number of neurons does
not take influence on the expressiveness of the context space. The range in which
sequences are encoded is the same as the weight space. Thus, both the size of the
weight space and the computation accuracy are limiting the number of different
contexts, independently of the number of neurons of the network.
Based on these considerations, richer and in particular explicit representations of
context have been proposed. The models that we introduce in the following ex-
tend the parameter space of each neuron j by an additional vector cj , which isused to explicitly store the sequential context within which a sequence entry is ex-
pected. Depending on the model, the context cj is contained in a representationspace with different dimensionality. However, in all cases this space is independent
of the weight space and extends the expressiveness of the models in comparison
to TKM and RSOM. For each model, we will define the basic ingredients: what is
the space of context representations? How is the distance between a sequence entry
and neuron j computed, taking into account its temporal context cj? How are theweights and contexts adapted?
The Recursive SOM (RecSOM) [41] equips each neuron nj with a weight wj R
n that represents the given sequence entry, as usual. In addition, a vector cj
7
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
8/30
RN is provided, N denoting the number of neurons, which explicitly represents
the contextual map activation of all neurons in the previous time step. Thus, the
temporal context is represented in this model in an N-dimensional vector space, Ndenoting the number of neurons. One can think of the context as an explicit storage
of the activity profile of the whole map in the previous time step. More precisely,
distance is recursively computed by
dRecSOM((s1, . . . , st), nj) = 1s1 wj2 + 2CRecSOM(s2, . . . , st) cj
2
where 1, 2 > 0.
CRecSOM(s) = (exp(dRecSOM(s, n1)), . . . , exp(dRecSOM(s, nN)))
constitutes the context. Note that this vector is almost the vector of distances of all
neurons computed in the previous time step. These are exponentially transformed
to avoid an explosion of the values. As before, the above distance can be decom-
posed into two parts: the winner computation similar to standard SOM, and, as in
the case of RSOM and TKM, a term which assesses the context match. For Rec-
SOM the context match is a comparison of the current context when processing
the sequence, i.e. the vector of distances of the previous time step, and the expected
context cj which is stored at neuron j. That is to say, RecSOM explicitly stores con-text vectors for each neuron and compares these context vectors to their expected
contexts during the recursive computation. Since the entire map activation is taken
into account, sequences of any given fixed length can be stored, if enough neurons
are provided. Thus, the representation space for context is no longer restricted by
the weight space and its capacity now scales with the number of neurons.
For RecSOM, training is done in Hebbian style for both weights and contexts. De-
note by nj0 the winner for sequence entry i, then the weight changes are
wj = h(nhd(nj0, nj)) (si wj)
and the context adaptation is
cj = h(nhd(nj0, nj)) (CRecSOM(si+1, . . . , st) cj)
The latter update rule makes sure that the context vectors of the winner neuron
and its neighborhood become more similar to the current context vector CRecSOM,which is computed when the sequence is processed. The learning rates are ,
(0, 1). As demonstrated in [41], this richer representation of context allows a betterquantization of time series data. In [41], various quantitative measures to evaluatetrained recursive maps are proposed, such as the temporal quantization error and
the specialization of neurons. RecSOM turns out to be clearly superior to TKM and
RSOM with respect to these measures in the experiments provided in [41].
8
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
9/30
However, the dimensionality of the context for RecSOM equals the number of neu-
rons N, making this approach computationally quite costly. The training of veryhuge maps with several thousands of neurons is no longer feasible for RecSOM.
Another drawback is given by the exponential activity transfer function in the term
of CRecSOM RN: specialized neurons are characterized by the fact that they have
only one or a few well-matching predecessors contributing values of about 1 to
CRecSOM; however, for a large number N of neurons, the noise influence on CRecSOMfrom other neurons destroys the valid context information, because even poorly
matching neurons contributing values of slightly above 0 are summed up in the
distance computation.
SOM for structured data (SOMSD) as proposed in [10,11] is an efficient and still
powerful alternative. SOMSD represents temporal context by the corresponding
winner index in the previous time step. Assume that a regular l-dimensional latticeof neurons is given. Each neuron nj is equipped with a weight wj R
n and a
value cj Rl which represents a compressed version of the context, the location
of the previous winner within the map [10]. The space in which context vectors
are represented is the vector space Rl for this model. The distance of sequence
s = (s1, . . . , st) from neuron nj is recursively computed by
dSOMSD((s1, . . . , st), nj) = 1s1 wj2 + 2CSOMSD(s2, . . . , sn) cj
2
where CSOMSD(s) equals the location of neuron nj with smallest dSOMSD(s, nj) in thegrid topology. Note that the context CSOMSD is an element in a low-dimensional vec-tor space, usually only R2. The distance between contexts is given by the Euclidean
metric within this vector space. The learning dynamic of SOMSD is very similar
to the dynamic of RecSOM: the current distance is defined as a mixture of two
terms, the match of the neurons weight and the current sequence entry, and the
match of the neurons context weight and the context currently computed in the
model. Thereby, the current context is represented by the location of the winningneuron of the map in the previous time step. This dynamic imposes a temporal bias
towards those neurons which context vector matches the winner location of the pre-
vious time step. It relies on the fact that a lattice structure of neurons is defined and
a distance measure of locations within the map is defined.
Due to the compressed context information, this approach is very efficient in com-
parison to RecSOM and also very large maps can be trained. In addition, noise
is suppressed in this compact representation. However, more complex context in-
formation is used than for TKM and RSOM, namely the location of the previous
winner in the map. As for RecSOM, Hebbian learning takes place for SOMSD, be-cause weight vectors and contexts are adapted in a well-known correction manner,
here by the formulas
wj = h(nhd(nj0, nj)) (si wj)
9
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
10/30
and
cj = h(nhd(nj0, nj)) (CSOMSD(si+1, . . . , st) cj)
with learning rates , (0, 1). nj0 denotes the winner for sequence entry i.As demonstrated in [11], a generalization of this approach to tree structures can
reliably model structured objects and their respective topological ordering.
We would like to point out that, although these approaches seem different, theyconstitute instances of the same recursive computation scheme. As proved in [14],
the underlying recursive update dynamics comply with
d((s1, . . . , st), nj) = 1s1 wj2 + 2C(s2, . . . , sn) cj
2
in all the cases. Their specific similarity measures for weights and contexts are de-
noted by the generic expression. The approaches differ with respect to theconcrete choice of the context C: TKM and RSOM refer to only the neuron itselfand are therefore restricted to local fractal codes within the weight space; RecSOM
uses the whole map activation, which is powerful but also expensive and subjectto random neuron activations; SOMSD relies on compressed information, the lo-
cation of the winner. Note that also standard supervised recurrent networks can be
put into the generic dynamic framework by choosing the context as the output of
the sigmoidal transfer function [14]. In addition, alternative compression schemes,
such as a representation of the context by the winner content, are possible [37].
To summarize this section, essentially four different models have been proposed
for processing temporal information. The models are characterized by the way in
which context is taken into account within the map. The models are:
Standard SOM: no context representation; standard distance computation; stan-
dard competitive learning.TKM and RSOM: no explicit context representation; the distance computation
recursively refers to the distance of the previous time step; competitive learning
for the weight whereby (for RSOM) the averaged signal is used.
RecSOM: explicit context representation as N-dimensional activity profile of theprevious time step; the distance computation is given as mixture of the current
match and the match of the context stored at the neuron and the (recursively com-
puted) current context given by the processed time series; competitive learning
adapts the weight and context vectors.
SOMSD: explicit context representation as low-dimensional vector, the location
of the previously winning neuron in the map; the distance is computed recur-sively the same way as for RecSOM, whereby a distance measure for locations
in the map has to be provided; so far, the model is only available for standard
rectangular Euclidean lattices; competitive learning adapts the weight and con-
text vectors, whereby the context vectors are embedded in the Euclidean space.
10
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
11/30
In the following, we focus on the context representation by the winner index, as
proposed in SOMSD. This scheme offers a compact and efficient context repre-
sentation. However, it relies heavily on the neighborhood structure of the neurons,
and faithful topological ordering is essential for appropriate processing. Since for
sequential data, like for words in , the number of possible strings is an expo-nential function of their length, an Euclidean target grid with inherent power law
neighborhood growth is not suited for a topology preserving representation. The
reason for this is that the storage of temporal data is related to the representation
of trajectories on the neural grid. String processing means beginning at a node that
represents the start symbol; then, how many nodes ns can in the ideal case uniquelybe reached in a fixed number s of steps? In grids with 6 neurons per neighbor thetriangular tessellation of the Euclidean plane leads to a hexagonal superstructure,
inducing the surprising answer of ns = 6 for any choice of s > 0. Providing 7neurons per neighbor yields the exponential branching ns = 7 2
(s1) of paths.
In this respect, it is interesting to note that RecSOM can also be combined with
alternative lattice structures; in [41] a comparison is presented of RecSOM with a
standard rectangular topology and a data optimum topology provided by neural gas
(NG) [27,28]. The latter clearly leads to superior results. Unfortunately, it is not
possible to combine the optimum topology of NG with SOMSD: for NG, no gridwith straightforward neuron indexing exists. Therefore, context cannot be defined
easily by referring back to the previous winner, because no similarity measure is
available for indices of neurons within a grid topology.
Here, we extend SOMSD to grid structures with triangular grid connectivity in
order to obtain a larger flexibility for the lattice design. Apart from the standard
Euclidean plane, the sphere and the hyperbolic plane are alternative popular two-
dimensional manifolds. They differ from the Euclidean plane with respect to their
curvature: the Euclidean plane is flat, whereas the hyperbolic space has negative
curvature, and the sphere is curved positively. By computing the Euler characteris-
tics of all compact connected surfaces, it can be shown that only seven have non-negative curvature, implying that all but seven are locally isometric to the hyper-
bolic plane, which makes the study of hyperbolic spaces particularly interesting. 3
The curvature has consequences on regular tessellations of the referred manifolds as
pointed out in [30]: the number of neighbors of a grid point in a regular tessellation
of the Euclidean plane follows a power law, whereas the hyperbolic plane allows
an exponential increase of the number of neighbors. The sphere yields compact
lattices with vanishing neighborhoods, whereby a regular tessellation for which all
vertices have the same number of neighbors is impossible (with the uninteresting
exception of an approximation by one of the 5 Platonic solids). Since all these
surfaces constitute two-dimensional manifolds, they can be approximated locally
within a cell of the tessellation by a subset of the standard Euclidean plane without
3 For an excellent tool box and introduction to hyperbolic geometry see e.g.
http://www.geom.uiuc.edu/docs/forum/hype/hype.html
11
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
12/30
too much contortion. A global isometric embedding, however, is not possible in
general. Interestingly, for all such tessellations a data similarity measure is defined
and possibly non-isometric visualization in the 2D plane can be achieved. While 6
neighbors per neuron lead to standard Euclidean triangular meshes, for a grid with
7 neighbors or more, the graph becomes part of the 2-dimensional hyperbolic plane.
As already mentioned, exponential neighborhood growth is possible and hence an
adequate data representation can be expected for the visualization of domains with
a high connectivity of the involved objects. SOM with hyperbolic neighborhood
(HSOM) has already proved well-suited for text representation as demonstrated for
a non-recursive model in [29].
3 SOM for sequences (SOM-S)
In the following, we introduce the adaptation of SOMSD for sequences and the
general triangular grid structure, SOM for sequences (SOM-S). Standard SOMs
operate on a rectangular neuron grid embedded in a real-valued vector space. More
flexibility for the topological setup can be obtained by describing the grid in termsof a graph: neural connections are realized by assigning each neuron a set of direct
neighbors. The distance of two neurons is given by the length of a shortest path
within the lattice of neurons. Each edge is assigned the unit length 1. The number ofneighbors might vary (also within a single map). Less than 6 neighbors per neuron
lead to a subsiding neighborhood, resulting in graphs with small numbers of nodes.
Choosing more than 6 neighbors per neuron yields, as argued above, an exponentialincrease of the neighborhood size, which is convenient for representing sequences
with potentially exponential context diversification.
Unlike standard SOM or HSOM, we do not assume that a distance preserving em-
bedding of the lattice into the two dimensional plane or another globally parame-terized two-dimensional manifold with global metric structure, such as the hyper-
bolic plane, exists. Rather, we assume that the distance of neurons within the grid
is computed directly on the neighborhood graph, which might be obtained by any
non-overlapping triangulation of the topological two-dimensional plane. 4 For our
experiments, we have implemented a grid generator for a circular triangle mesh-
ing around a center neuron, which requires the desired number of neurons and the
neighborhood degree n as parameters. Neurons at the lattice edge possess less thann neighbors, and if the chosen total number of neurons does not lead to filling upthe outer neuron circle, neurons there are connected to others in a maximum sym-
metric way. Figure 1 shows a small map with 7 neighbors for the inner neurons,and a total of 29 neurons perfectly filling up the outer edge. For 7 neighbors, theexponential neighborhood increase can be observed, for which an embedding into
4 Since the lattice is fixed during training, these values have to be computed only once.
12
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
13/30
N3
N2
N1
n
13
12
D1
D2
Fig. 1. Hyperbolic self organizing map with context. Neuron n refers to the context givenby the winner location in the map, indicated by the triangle of neurons N1, N2, and N3,
and the precise coordinates 12,13. If the previous winner has been D2, adaptation of the
context along the dotted line takes place.
the Euclidean plane is not possible without contortions; however, local projections
in terms of a fish eye magnification focus can be obtained (cf. [29]).
SOMSD adapts the location of the expected previous winner during training. For
this purpose, we have to embed the triangular mesh structure into a continuous
space. We achieve this by computing lattice distances beforehand, and then we ap-
proximate the distance of points within a triangle shaped map patch by the standard
Euclidean distance. Thus, positions in the lattice are represented by three neuron
indices which represent the selected triangle of adjacent neurons, and two real num-
bers which represent the position within the triangle. The recursive nature of the
shown map is illustrated exemplarily in figure 1 for neuron n. This neuron n isequipped with a weight w Rn and a context c that is given by a location withinthe triangle of neurons N1, N2, and N3 expressing corner affinities by means of
the linear combination parameters 12 and 13. The distance of a sequence s fromneuron n is recursively computed by
dSOM-S((s1, . . . , st), n) = s1 w2 + (1 ) g(CSOM-S(s2, . . . , sn), c).
CSOM-S(s) is the index of the neuron nj in the grid with smallest distance dSOM-S(s, nj).g measures the grid distance of the triangular position cj = (N1,N2,N3,12,13)to the winner as the shortest possible path in the mesh structure. Grid distances
between neighboring neurons possess unit length, and the metric structure within
the triangleN1,N2,N3 is approximated by the Euclidean metric. The range of gis normalized by scaling with the inverse maximum grid distance. This mixture of
hyperbolic grid distance and Euclidean distance is valid, because the hyperbolic
space can locally be approximated by Euclidean space, which is applied for com-
putational convenience to both distance calculation and update.
13
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
14/30
Training is carried out by presenting a pattern s = (s1, . . . , st), determining thewinner nj0 , and updating the weight and the context. Adaptation affects all neuronson the breadth first search graph around the winning neuron according to their
grid distances in a Hebbian style. Hence, for the sequence entry si, weight wj isupdated by wj = h(nhd(nj0, nj)) (si wj). The learning rate is typicallyexponentially decreased during training; as above, h(nhd(nj0, nj)) describes theinfluence of the winner nj0 to the current neuron nj as a decreasing function ofgrid distance. The context update is analogous: the current context, expressed in
terms of neuron triangle corners and coordinates, is moved towards the previous
winner along a shortest path. This adaptation yields positions on the grid only.
Intermediate positions can be achieved by interpolation: if two neurons Ni and Njexist in the triangle with the same distance, the midway is taken for the flat grids
obtained by our grid generator. This explains why the update path, depicted as the
dotted line in figure 1, for the current context towards D2 is via D1. Since the grid
distances are stored in a static matrix, a fast calculation of shortest path lengths is
possible. The parameter in the recursive distance calculations controls the balancebetween pattern and context influence; since initially nothing is known about the
temporal structure, this parameter starts at 1, thus indicating the absence of context,and resulting in standard SOM. During training it is decreased to an application
dependent value that mediates the balance between the externally presented patternand the internally gained model about historic contexts.
Thus, we can combine the flexibility of general triangular and possibly hyperbolic
lattice structures with the efficient context representation as proposed in [11].
4 Evaluation measures of SOM
Popular methods to evaluate the standard SOM are the visual inspection, the identi-
fication of meaningful clusters, the quantization error, and measures for topological
ordering of the map. For recursive self organizing maps, an additional dimension
arises: the temporal dynamic stored in the context representations of the map.
4.1 Temporal quantization error
Using ideas of Voegtlin [41] we introduce a method to assess the implicit repre-
sentation of temporal dependencies in the map, and to evaluate to which amount
faithful representation of the temporal data takes place. The general quantizationerror refers to the distortion of each map unit with respect to its receptive field,
which measures the extent of data space coverage by the units. If temporal data are
considered, the distortion needs to be assessed back in time. For a formal defini-
tion, assume that a time series (s1, s2, . . . , st, . . .) is presented to the network, again
14
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
15/30
with reverse indexing notation, i.e. s1 is the most recent entry of the time series. Letwini denote all time steps for which neuron i becomes the winner in the consideredrecursive map model. The mean activation of neuron i for time step t in the past isthe value
Ai(t) =
jwini
sj+t/|wini|.
Assume that neuron i becomes winner for a sequence entry sj . It can then be ex-pected that sj is like the standard SOM close to the average Ai(0), because the mapis trained with Hebbian learning. Temporal specification takes place if, in addition,
sj+t is close to the average Ai(t) for t > 0. The temporal quantization error ofneuron i at time step t back in the past is defined by
Ei(t) =
jwini
sj+t Ai(t)2
1/2
.
This measures the extent up to which the values observed t time steps back in the
past coincide with a winning neuron. Temporal specialization of neuron i takesplace if Ei(t) is small for t > 0. Since no temporal context is learned for thestandard SOM, the temporal quantization will be large for t > 0, just reflectingspecifics of the underlying time series such as smoothness or periodicity. For re-
cursive models, this quantity allows to assess the amount of temporal specification.
The temporal quantization error of the entire map for t time steps back into the pastis defined as the average
E(t) =Ni=1
Ei(t)/N
This method allows to evaluate whether the temporal dynamic in the recent past is
faithfully represented.
4.2 Temporal models
After the training of a recursive map, it can be used to obtain an explicit, possibly
approximative description of the underlying global temporal dynamics. This offers
another possibility to evaluate the dynamics of SOM because we can compare the
extracted temporal model to the original one, if available, or a temporal modelextracted directly from the data. In addition, a compressed description of the global
dynamics extracted from a trained SOM is interesting for data mining tasks. In
particular, it can be tested whether clustering properties of SOM, referred to by
U-matrix methods, transfer to the temporal domain.
15
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
16/30
Markov models constitute simple, though powerful techniques for sequence pro-
cessing and analysis [6,32]. Assume that = {a1, . . . , ad} is a finite alphabet. Theprediction of the next symbol refers to the task to anticipate the probability of aihaving observed a sequence s = (s1, . . . , st)
before. This is just the condi-
tional probability P(ai|s). For finite Markov models, a finite memory length l issufficient to determine this probability, i.e. the probability
P(ai|(s1, . . . , sl, . . . , st)) = P(ai|(s1, . . . , sl)) , (t l)
depends only on the past l symbols instead of the whole context (s1, . . . , st). Markovmodels can be estimated from given data if the order l is fixed. It holds that
P(ai|(s1, . . . , sl)) =P((ai, s1, . . . , sl))j P((aj, s1, . . . , sl))
(1)
which means that the next symbol probability can be estimated from the frequencies
of(l + 1)-grams.
We are interested in the question whether a trained SOM-S can capture the es-sential probabilities for predicting the next symbol, generated by simple Markov
models. For this purpose, we train maps on Markov models and afterwards extract
the transition probabilities entirely from the obtained maps. This extraction can be
done because of the specific form of context for SOM-S. Given a finite alphabet
= {a1, . . . , ad} for training, most neurons specialize during training and becomewinner for at least one or some stimuli. Winner neurons represent the input se-
quence entries w by their trained weight vectors. Usually, the weight wi of neuronni is very close to a symbol aj of and can thus be identified with the symbol.In addition, the neurons represent their context by an explicit reference to the lo-
cation of the winner in the previous time step. The context vectors stored in the
neurons define an intermediate winning position in the map encoded by the pa-
rameters (N1,N2,N3,12,13) for the closest three neurons and the exact positionwithin the triangle. We take this into account for extracting sequences correspond-
ing to the averaged weights of all three potential winners of the previous time step.
For the averaging, the contribution of each neuron to the interpolated position is
considered. Repeating this back-referencing procedure recursively for each winner
weighted by its influence, yields an exponentially spreading number of potentially
infinite time series for each of neuron. This way, we obtain a probability distribution
over time series that is representative for the history of each map neuron. 5
5 Interestingly, one can formally prove that every finite length Markov model can be ap-proximated by some map in this way in principle, i.e. for every Markov model of length la map exists such that the above extraction procedure yields the original model up to small
deviations. Assume a fixed length l and a rational P(ai|(s1, . . . , sl)) and denote by q thesmallest common denominator of the transition probabilities. Consider a map in which for
16
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
17/30
The number of specialized neurons for each time series is correlated to the proba-
bility of these stimuli in the original data source. Therefore, we can simply take the
mean of the probabilities for all neurons and obtain a global distribution over all
histories which are represented in the map. Since standard SOM has a magnification
factor different from 1, the number of neurons, which represent a symbol ai, devi-ates from the probability for ai in the given data [31]. This leads to a slightly biasedestimation of the sequence probabilities represented by the map. Nevertheless, we
will use the above extraction procedure as a sufficiently close approximation to the
true underlying distribution. This compromise is taken, because the magnification
factor for recurrent SOMs is not known and techniques from [31] for its compu-
tation cannot be transferred to recurrent models. Our experiments confirm that the
global trend is still correct. We have extracted for every finite memory length l theprobability distribution for words in l+1 as they are represented in the map anddetermined the transition probabilities of equation 1.
The method as described above is a valuable tool to evaluate the representation
capacity of SOM for temporal structures. Obviously, fixed order Markov models
can be better extracted directly from the given data, avoiding problems such as the
magnification factor of SOM. Hence, this method just serves as an alternative for
the evaluation of temporal self-organizing maps and their capability of representingtemporal dynamics. The situation is different if real-valued elements are processed,
like in the case of obtaining symbolic structure from noisy sequences. Then, a rea-
sonable quantization of the sequence entries must be found before a Markov model
can be extracted from the data. The standard SOM together with U-matrix methods
provides a valuable tool to find meaningful clusters in a given set of continuous
data. It is an interesting question whether this property transfers to the temporal
domain, i.e. whether meaningful clusters of real-valued sequence entries can also
be extracted from a trained recursive model. SOM-S allows to combine both reli-
able quantization of the sequence entries and the extraction mechanism for Markov
models to take into account the temporal structure of the data.
For the extraction we extend U-Matrix methods to recursive models as follows [38]:
the standard U-Matrix assigns to each neuron the averaged distance of its weight
vector compared to its direct lattice neighbors:
U(ni) =
nhd(ni,nj)=1
wi wj
each symbol ai a cluster of neurons with weights wj = ai exist. These main clusters aredivided into subclusters enumerated by s = (s1, . . . , sl)
l with q P(ai|s) neurons for
each possible s. The context of each of such neuron refers to another neuron within a clusterbelonging to s1 and to a subcluster belonging to (s2, . . . , sl, sl+1) for some arbitrary sl+1.Note that the clusters can thereby be chosen contiguous on the map respecting the topolog-
ical ordering of the neurons. The extraction mechanism leads to the original Markov model
(with rational probabilities) based on this map.
17
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
18/30
In a trained map, neurons spread in regions of the data space where a high sample
density can be observed, resulting in large U-values at borders between clusters.
Consequently, the U-Matrix forms a 3D landscape on the lattice of neurons with
valleys corresponding to meaningful clusters and hills at the cluster borders. The
U-Matrix of weight vectors can be constructed also for SOM-S. Based on this ma-
trix, the sequence entries can be clustered into meaningful categories, based on
which the extraction of Markov models as described above is possible. Note that
the U-Matrix is built by using the weights assigned to the neurons only, while the
context information of SOM-S is yet ignored. 6 However, since context informa-
tion is used for training, clusters emerge which are meaningful with respect to the
temporal structure, and this way they contribute implicitly to the topological order-
ing of the map and to the U-Matrix. Partially overlapping, noisy, and ambiguous
input elements are separated during the training, because the different temporal
contexts contain enough information to activate and produce characteristic clusters
on the map. Thus, the temporal structure captured by the training allows a reliable
reconstruction of the input sequences, which could not have been achieved by the
standard SOM architecture.
5 Experiments
5.1 Mackey-Glass time series
The first task is to learn the dynamic of the real-valued chaotic Mackey-Glass time
series dxd
= bx() + ax(d)1+x(d)10
using a = 0.2, b = 0.1, d = 17. This is the same
setup as given in [41] making a comparison of the results possible. 7 Three types
of maps with 100 neurons have been trained: a 6-neighbor map without contextgiving standard SOM, a map with 6 neighbors and with context (SOM-S), and
a 7-neighbor map providing a hyperbolic grid with context utilization (H-SOM-S). Each run has been computed with 1.5 105 presentations starting at randompositions within the Mackey-Glass series using a sample period of t = 3; theneuron weights have been initialized white within [0.6, 1.4]. The context has beenconsidered by decreasing the parameter from = 1 to = 0.97. The learning rateis exponentially decreased from 0.1 to 0.005 for weight and context update. Initialneighborhood cooperativity is 10 which is annealed to 1 during training.
Figure 2 shows the temporal quantization error for the above setups: the temporal
quantization error is expressed by the average standard deviation of the given se-
quence and the mean unit receptive field for 29 time steps into the past. Similar
6 Preliminary experiments indicate that the context also orders topologically and yields
meaningful clusters. The number of neurons in context clusters is thereby small compared
to the number of neurons and statistically significant results could not be obtained.7 We would like to thank T.Voegtlin for providing data for comparison.
18
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
19/30
to Voegtlins results, we observe large cyclic oscillations driven by the periodicity
of the training series for standard SOM. Since SOM does not take contextual in-
formation into account, this quantization result can be seen as an upper bound for
temporal models, at least for the indices > 0 reaching into the past (trivially, SOMis a very good quantizer of scalar elements without history); the oscillating shape
of the curve is explained by the continuity of the series and its quasi-periodic dy-
namic, and extrema exist rather by the nature of the series than by special model
properties. Obviously, the very restricted context of RSOM does not yield a long
term improvement of the temporal quantization error. However, the displayed er-
ror periodicity is anti-cyclic compared to the original series. Interestingly, the data
optimum topology of neural gas (NG), which also does not take contextual infor-
mation into account, allows a reduction of the overall quantization error; however,
the main characteristics, such as the periodicity, remain the same as for standard
SOM. RecSOM leads to a much better quantization error than RSOM and also NG.
Thereby, the error is minimum for the immediate past (left side of the diagram),
and increases for going back in time, which is reasonable because of the weighting
of context influence by (1 ). The increase of the quantization error is smoothand the final values after 29 time steps is better than the default given by standardSOM. In addition, almost no periodicity can be observed for RecSOM. SOM-S
and H-SOM-S further improve the results: only some periodicity can be observed,and the overall quantization error increases smoothly for the past values. Note that
these models are superior to RecSOM in this task while requiring less computa-
tional power. H-SOM-S allows a slightly better representation of the immediate
past compared to SOM-S due to the hyperbolic topology of the lattice structure
that matches better the characteristics of the input data.
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30
QuantizationError
Index of past inputs (index 0: present)
* SOM* RSOM
NG* RecSOM
H-SOM-SSOM-S
Fig. 2. Temporal quantization errors of different model setups for the Mackey-Glass series.
Results indicated by are taken from [41].
19
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
20/30
5.2 Binary automata
The second experiment is also inspired by Voegtlin. A discrete 0/1-sequence gener-
ated by a binary automaton with P(0|1) = 0.4 and P(1|0) = 0.3 shall be learned.For discrete data, the specialization of a neuron can be defined as the longest se-
quence that still leads to unambiguous winner selection. A high percentage of spe-
cialized neurons indicates that temporal context has been learned by the map. In
addition, one can compare the distribution of specializations with the original dis-tribution of strings as generated by the underlying probability. Figure 3 shows the
specialization of a trained H-SOM-S. Training has been carried out with 3 106 pre-sentations, increasing the context influence (1 ) exponentially from 0 to 0.06.The remaining parameters have been chosen as in the first experiment. Finally, the
receptive field has been computed by providing an additional number of 106 testiterations. Putting more emphasis on the context results in a smaller number of ac-
tive neurons representing rather long strings that cover only a small part of the total
input space. If a Euclidean lattice is used instead of a hyperbolic neighborhood,
the resulting quantizers differ only slightly, which indicates that the representation
of binary symbols and their contexts in the 2-dimensional output space represen-
tations does barely benefit from exponential branching. In the depicted run, 64 ofthe neurons express a clear profile, whereas the other neurons are located at sparse
locations of the input data topology, between cluster boundaries, and thus do not
win for the presented stimuli. The distribution corresponds nicely to the 100 mostcharacteristic sequences of the probabilistic automaton as indicated by the graph.
Unlike RecSOM (presented in [41]), also neurons at interior nodes of the tree are
expressed for H-SOM-S. These nodes refer to transient states, which are repre-
sented by corresponding winners in the network. RecSOM, in contrast to SOM-S,
does not rely on the winner index only, but it uses a more complex representa-
tion: since the transient states are spared, longer sequences can be expressed by
RecSOM. In addition to the examination of neuron specialization, the whole map
01
23456789
1011
100 most likely sequencesH-SOM-S, 100 neurons64 specialized neurons
Fig. 3. Receptive fields of a H-SOM-S compared to the most probable sub-sequences of the
binary automaton. Left hand branches denote 0, right is 1.
20
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
21/30
Type P(0) P(1) P(0|0) P(1|0) P(0|1) P(1|1)
Automaton 1 4/7 0.571 3/7 0.429 0.7 0.3 0.4 0.6
Map (98/100) 0.571 0.429 0.732 0.268 0.366 0.634
Automaton 2 2/7 0.286 5/7 0.714 0.8 0.2 0.08 0.92
Map (138/141) 0.297 0.703 0.75 0.25 0.12 0.88
Automaton 3 0.5 0.5 0.5 0.5 0.5 0.5
Map (138/141) 0.507 0.493 0.508 0.492 0.529 0.471
Table 1
Results for binary automata extraction with different transition probabilities. The extracted
probabilities clearly follow the original ones.
representation can be characterized by comparing the input symbol transition statis-
tics with the learned context-neuron relations. While the current symbol is coded
by the winning neurons weight, the previous symbol is represented by the average
of weights of the winners context triangle neurons. The obtained two values the
neurons state and the average state of the neurons context are clearly expressed
in the trained map: only few neurons contain values in an indeterminate interval
[13
, 23
], but most neurons specialize on very close to 0 or 1. Results for the recon-struction of three automata can be found in table 1. For the reconstruction we have
used the algorithm described in section 4.2 with memory length 1. The left columnindicates the number of expressed neurons and the total number of neurons in the
map. Note that the automata can be well reobtained from the trained maps. Again,
the temporal dependencies are clearly captured by the maps.
5.3 Reber grammar
In a third experiment we have used more structured symbolic sequences as gener-
ated by the Reber grammar illustrated in figure 4. The 7 symbols have been coded
in a 6-dimensional Euclidean space by points that denote the same as a tetrahedron
does with its four corners in three dimensions: all points have the same distance
*
8
8
2:
:
2
56
5
6
-
Fig. 4. State graph of the Reber grammar.
21
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
22/30
from each other. For training and testing we have taken the concatenation of ran-
domly generated words, such preparing sequences of 3 106 and 106 input vectors,respectively. The map has got a map radius of 5 and contains 617 neurons on anhyperbolic grid. For the initialization and the training, the same parameters as in the
previous experiment were used, except for an initially larger neighborhood range of
14, corresponding to the larger map. Context influence was taken into account bydecreasing from 1 to 0.8 during training. A number of 338 neurons developed aspecialization for Reber strings with an average length of7.23 characters. Figure 5shows that the neuron specializations produce strict clusters on the circular grid,
ordered in a topological way by the last character. In agreement with the grammar,
the letter T takes the largest sector on the map. The underlying hyperbolic lattice
gives rise to sectors, because they clearly minimize the boundary between the 7
classes. The symbol separation is further emphasized by the existence of idle neu-
rons between the boundaries, which can be seen analogously to large values in a
U-Matrix. Since neuron specialization takes place from the most common states
which are the 7 root symbols to the increasingly special cases, the central nodes
have fallen idle after they have served as signposts during training; finally the most
specialized nodes with their associated strings are found at the lattice edge on the
outer ring. Much in contrast to the such ordered hyperbolic target lattice, the re-
sult for the Euclidean grid in figure 7 shows a neuron arrangement in the form ofpolymorphic coherent patches.
Similar to the binary automata learning tasks, we have analyzed the map represen-
tation by the reconstruction of the trained data by backtracking all possible context
sequences of each neuron up to length 3. Only 118 of all 343 combinatorially pos-sible trigrams are realized. In a ranked table the most likely 33 strings cover allattainable Reber trigrams. In the log-probability plot 6 there is a leap between entry
number 33 (TSS, valid) and 34 (XSX, invalid), emphasizing the presence of the Re-ber characteristic. The correlation of the probabilities of Reber trigrams and their
relative frequencies found in the map is 0.75. An explicit comparison of the proba-
bilities of valid Reber strings can be found in figure 8. The values deviate from thetrue probabilities, in particular for cycles of the Reber graph, such as consecutive
letters T and S, or the VPX-circle. This effect is due to the magnification factor
different from 1 for SOM, which further magnifies when sequences are processedin the proposed recursive manner.
5.4 Finite memory models
In a final series of experiments, we examine a SOM-S trained on Markov modelswith noisy input sequence entries. We investigate the possibility to extract tempo-
ral dependencies on real-valued sequences from a trained map. The Markov model
possesses a memory length of2 as depicted in figure 9. The basic symbols are de-noted by a, b, and c. These are embedded in two dimensions, disrupted by noise, as
22
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
23/30
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
24/30
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
***
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
XTVVEB
SEBPV
EBTXS
XXTVVEB
EBPV
TVPS
TVPSEB
VPSEB
EBTXXT
TVVEBP
EBTXXT
EBP
TVPXT
TVVEBTX
VPSEBTX
VPXTTVPX
EBPTTV
XTTV
VVEBP
TTVVEBP
VPXTTVV
TTTTVVSSXS
TVVEBT
SEBTX
TVPSEBT
EBTX
VPSEBT
VPX
EBTXXTT
XTT
EBTSSXXTVP
TVPXTT
TVP
TVVE
EBTSXSEB
EBTXSE
VVEBTXX
TTV
EBTSSXXTVPSE
EBTXX
EBPVVEBP
SSXX
EBTSX
TTTTVVEBP
VPXTTVPXT
TTTV
TVPXT
TTTTV
EBTSS
EBTXSEBT
EBTXS
EBTXXTTVV
BTSSXXTVPSEBPVVPSEBP
TTVVEBPV
EBPVVE
TTTTVVE
EBTSSXX
VPXTTVVE
VPXTTVVEBTXX
EBTSXX
TVPS
EBTSSXXTVPS
TVPX
VVEBTS
SSX
EBTXXSEBTS
TTVP
EBTSSX
SEBTXX
TTTTVP
TVPXT
EBPVVEB
TVP
VVEBTSX
TTTTVVEB
SSXSEBT
TVVEBPTV
EBTSSXXTVPSEBP
EBTSXSEBT
EBTXSEBP
VVEBPTT
EBP
EBPTT
TTVVEBT
EBTSSXXTT
EBPVVPS
TVPXTT
SSXXTV
VVEBPV
TTVPS
TTT
XTV
TVPXTV
EBPVVE
TTVVEBTS
XTTVVE
EBTS
EBTXXTTV
TTVVE
EBTXSEBTS
TTVV
XTVV
TVVE
SEBPVV
EBPVV
XTVVE
VVEB
EBPTVPXT
TSSXXTVPSEBPVP
XXTVVE
TTVVEB
SEBPVP
TVVEB
TVPX
SSXSEEBTSSXSE
VVEBPTTVVEBPTEBPT VPXT
VPXTTVVEBT
TSSXXTVPSEBPVV
EBPVV
EBTXSEBT
EBPVP
EBPVP
EBPVPX
EBTSXSE
SSXSEB
TTVVEBPTEBPT
VPXTTV
EBTSSXXT
EBTSXS
EBPVPXT
EBTSSXS
TVVEBT
TTVV
XTTVV
VVEBPTV
EBPTV
VVEBPVV
TVV
EBTSSXXTVPSEBT
TVVEB
SEBPTT
XXTT
EBPTTVP
EBTXSEB
VPXTT
TVPXTVP
TTT
TVV
SXSEBP
TVPXTTTT
TVVE
VPXTTVP
TTVP
EBTXSE
TVPEBPTVP
VVEBT
VPXTTVVEB
EBTXXTV
EBTSSXXTV
VVEBTSS
EBTXSEB
EBPVPXTV
EBTSSXXTVPSEB
SSXXTVV
XXTVV
EBTX EBPVPXTVV
SEBPT
VPXTTVVEBTX
VVEBTX
XTTT
SSXXT
EBTSSXXTVPX
TVPXTTT
TVPSE
TVPX
TTTTTEBPTVPX
TTTT
VPSE
TTV
SSSS
EBTSSS
Fig. 7. Arrangement of Reber words on a Euclidean lattice structure. The words are ar-
ranged according to their most recent symbols (shown on the right of the sequences).
Patches emerge according to the most recent symbol. Within the patches, an ordering ac-
cording to the preceding symbols can be observed.
C
Fig. 8. Frequency reconstruction of trigrams from the Reber grammar.
24
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
25/30
follows: a stands for (0, 0) + , b for (1, 0) + , and c for (0, 1) + , being inde-pendent Gaussian noise with standard deviation g, which is a variable to be testedin the experiments. The symbols are denoted right to left, i.e. ab indicates that thecurrently emitted symbol is a, after having observed symbol b in the previous step.Thus, b and c are always succeeded by a, whereas a is succeeded with probabilityx by b, and (1 x) by c assumed the past symbol was b, and vice versa, if thelast symbol was c. The transition probability x is varied between the experiments.We train a SOM-S with regular rectangular two-dimensional lattice structure and
100 neurons for a generated Markov series. The context parameter was decreasedfrom = 0.97 to = 0.93, the neighborhood radius was decreased from = 5to = 0.5, the learning rate was annealed from 0.02 to 0.005. A number of 1000patterns are presented in 15000 cycles. U-Matrix clustering has been calculatedwith such a level of the landscape that half the neurons are contained in valleys.
The neurons in the same valleys are assigned to belong to the same cluster, and the
number of different clusters is determined. Afterwards, all the remaining neurons
are assigned to their closest cluster.
First, we choose a noise level of g = 0.1 such that almost no overlap can beobserved, and we investigate this setup with different x between 0 and 0.8. In all
the results, three distinct clusters, corresponding to the three symbols, are foundwith the U-Matrix method. The extraction of the order 2 Markov models indicatesthat the global transition probabilities are correctly represented in the maps.Table 2
shows the corresponding extracted probabilities. Thereby, the exact probabilities
cannot be recovered because of a magnification factor of SOM different from 1.However, the global trend is clearly found and the extracted probabilities are in
good agreement with the priorly chosen values.
In a second experiment, the transition probability is fixed to x = 0.4, but the noiselevel is modified, choosing g between 0.1 and 0.5. All the training parameters arechosen as in the previous experiment. Note that a noise level g = 0.3 already yields
much overlap of the classes, as depicted in figure 10. Nevertheless, three clusterscan be detected in all of the cases and the transition probabilities can be recovered,
except for a noise level of 0.5 for which the training scenario degenerates to analmost deterministic case, making a the most dominant state. Table 3 summarizesthe extracted probabilities.
1
ac
caba
ab
1x 1x
1x x
Fig. 9. Markov automaton with 3 basic states and a finite order of2 used to train the map.
25
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
26/30
Fig. 10. Symbols a, b, c which are embedded in R2 as a = (0, 0) + , b = (1, 0) + , andc = (0, 1) + , subject to noise with different variances: noise level are 0.1, 0.3, and 0.4.The latter two noise levels show considerable overlap of the classes which represent the
symbol.
x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
P(a|ab) 0 0.01 0 0.01 0 0.04 0 0.04 0.01
P(b|ab) 0 0.08 0.3 0.31 0.38 0.55 0.68 0.66 0.78
P(c|ab) 1 0.91 0.7 0.68 0.62 0.41 0.32 0.3 0.21
P(a|ac) 0 0 0 0 0 0.01 0.01 0 0.01
P(b|ac) 1 0.81 0.8 0.66 0.52 0.55 0.32 0.31 0.24
P(c|ac) 0 0.19 0.2 0.34 0.48 0.44 0.67 0.69 0.75
Table 2
Transition probabilities extracted from the trained map. The noise level was fixed to 0.1and different generating transition probabilities x were used.
noise 0.1 0.2 0.3 0.4 0.5 true
P(a|ab) 0.01 0 0 0.1 0.98 0
P(b|ab) 0.42 0.49 0.4 0.24 0.02 0.4
P(c|ab) 0.57 0.51 0.6 0.66 0.02 0.6
P(a|ac) 0.01 0 0 0.09 0 0
P(b|ac) 0.59 0.6 0.44 0.39 0 0.6
P(c|ac) 0.4 0.4 0.56 0.52 0 0.4
Table 3
Probabilities extracted from the trained map with fixed input transition probabilities and
different noise levels. For a noise level of0.5
, the extraction mechanism breaks down and
the symbol a becomes most dominant. For smaller noise levels, extraction of the symbolscan still be done also for overlapping clusters because of temporal differentiation of the
clusters in recursive models.
26
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
27/30
6 Conclusions
We have presented a self organizing map with a neural back-reference to the pre-
viously active sites and with a flexible topological structure of the neuron grid. For
context representation, the compact and powerful SOMSD model as proposed in
[11] has been used. Unlike TKM and RSOM, much more flexibility and expres-
siveness is obtained, because the context is represented in the space spanned by
the neurons, and not only in the domain of the weight space. Compared to Rec-
SOM, which is based on very extensive contexts, the SOMSD model is much moreefficient. However, SOMSD requires an appropriate topological representation of
the symbols, measuring distances of contexts in the grid space. We have therefore
extended the map configuration to more general triangular lattices, thus, making
also hyperbolic models possible as introduced in [30]. Our SOM-S approach has
been evaluated on several data series including discrete and real-valued entries.
Two experimental setups have been taken from [41] to allow a direct comparison
with different models. As pointed out, the compact model introduced here improves
the capacity of simple leaky integrator networks like TKM and RSOM and shows
results competitive to the more complex RecSOM.
Since the context of SOM-S directly refers to the previous winner, temporal con-texts can be extracted from a trained map. An extraction scheme to obtain Markov
models of fixed order has been presented and its reliability has been confirmed in
three experiments. As demonstrated, this mechanism can be applied to real-valued
sequences, expanding U-Matrix methods to the recursive case.
So far, the topological structure of context formation has not been taken into ac-
count during the extraction. Context clusters, in addition to weight clusters, provide
more information, which might be used for the determination of appropriate orders
of the models, or for the extraction of more complex settings like hidden Markov
models. We currently investigate experiments aiming at these issues. However, pre-
liminary results indicate that Hebbian training, as introduced in this article, allowsthe reliable extraction of finite memory models only. More sophisticated training
algorithms should be developed for more complex temporal dependencies.
Interestingly, the proposed context model can be interpreted as the development
of long range synaptic connections, leading to more specialized map regions. Sta-
tistical counterparts to unsupervised sequence processing, like the Generative To-
pographic Mapping Through Time (GTMTT) [5], incorporate similar ideas by de-
scribing temporal data dependencies by hidden Markov latent space models. Such a
context effects the prior distribution on the space of neurons. Due to computational
restrictions, the transition probabilities of GTMTT are usually limited to only lo-
cal connections. Thus, long range connections like in the presented context modeldo not emerge, rather visualizations similar (though more powerful) to TKM and
RSOM arise. It could be interesting to develop more efficient statistical counter-
parts, which also allow the emergence of interpretable long range connections such
as those of the deterministic SOM-S.
27
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
28/30
References
[1] G. Barreto and A. Araujo. Time in self-organizing maps: An overview of models. Int.Journ. of Computer Research, 10(2):139179, 2001.
[2] G. de A. Barreto, A. F. R. Araujo, and S. C. Kremer. A taxonomy for spatiotemporalconnectionist networks revisited: the unsupervised case. Neural Computation,15(6):1255 - 1320, 2003.
[3] H.-U. Bauer and T. Villmann. Growing a hypercubical output space in a self-
organizing feature map. IEEE Transactions on Neural Networks, 8(2):218226, 1997.
[4] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: the generative topographicmapping. Neural Computation 10(1):215-235, 1998.
[5] C. M. Bishop, G. E. Hinton, and C. K. I. Williams. GTM through time. ProceedingsIEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K.,pages 111-116, 1997.
[6] Buhlmann, P., and Wyner, A.J. (1999). Variable length Markov chains. Annals ofStatistics, 27:480-513.
[7] O. A. Carpinteiro. A hierarchical self-organizing map for sequence recognition.Neural Processing Letters, 9(3):209-220, 1999.
[8] G. Chappell and J. Taylor. The temporal Kohonen map. Neural Networks, 6:441445,1993.
[9] I. Farkas and R. Miikkulainen. Modeling the self-organization of directionalselectivity in the primary visual cortex. Proceedings of ICANN99, Edinburgh,Scotland, pp. 251-256, 1999.
[10] M. Hagenbuchner, A. C. Tsoi, and A. Sperduti. A supervised self-organising map forstructured data. In N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances inSelf-Organising Maps, 2128. Springer, 2001.
[11] M. Hagenbuchner, A. Sperduti, and A.C. Tsoi. A Self-Organizing Map for AdaptiveProcessing of Structured Data. IEEE Transactions on Neural Networks, 14(3):491505, 2003.
[12] B. Hammer. On the learnability of recursive data. Mathematics of Control Signals andSystems, 12:6279, 1999.
[13] B. Hammer, A. Micheli, and A. Sperduti. A general framework for unsupervisedprocessing of structured data. In M. Verleysen, editor, European Symposium onArtificial Neural Networks2002, 389394. D Facto, 2002.
[14] B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A general framework forunsupervised processing of structured data. To appear in: Neurocomputing.
[15] B. Hammer, A. Micheli, A. Sperduti. A general framework for self-organizingstructure processing neural networks. Technical report TR-03-04 of the Universitadi Pisa, 2003.
[16] J. Joutsensalo and A. Miettinen. Self-organizing operator map for nonlinear dimensionreduction. Proceedings ICNN95, 1:111-114, IEEE, 1995.
[17] J. Kangas. On the analysis of pattern sequences by self-organizing maps. PhD thesis,Helsinki University of Technology, Espoo, Finland, 1994.
28
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
29/30
[18] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM self-organizing mapsof document collections. Neurocomputing, 21(1):101-117, 1998.
[19] S. Kaski and J. Sinkkonen. A topography-preserving latent variable model withlearning metrics. In: N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advancesin Self-Organizing Maps, pages 224229, Springer, 2001.
[20] T. Kohonen. The neural phonetic typewriter. Computer, 21(3):11-22, 1988.
[21] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 2001.
[22] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Recurrent SOM with local linear
models in time series prediction. In M.Verleysen, editor, 6th European Symposium onArtificial Neural Networks,pages 167172, De facto, 1998.
[23] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Time series prediction usingrecurrent SOM with local linear models. International Journal of Knowledge-basedIntelligent Engineering Systens 2(1):60-68, 1998.
[24] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review.Neural Computation, 13(2):249306, 2001.
[25] J. Laaksonen, M. Koskela, S. Laakso, and E. Oja. PicSOM content-based imageretrieval with self-organizing maps. Pattern Recognition Letters, 21(13-14):1199-1207, 2000.
[26] J. Lampinen and E. Oja. Self-organizing maps for spatial and temporal AR models.M. Pietikainen and J. Roning (eds.), Proceedings 6 SCIA, 120-127, Helsinki, Finland,1989.
[27] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507522, 1994.
[28] T. Martinetz, S.G. Berkovich, and K.J. Schulten. Neural-gas networks for vectorquantization and its application to time-series prediction. IEEE Transactions onNeural Networks, 4(4):558569, 1993.
[29] J. Ontrup and H. Ritter. Text categorization and semantic browsing with self-organizing maps on non-euclidean spaces. In L. D. Raedt and A. Siebes, editors,Proceedings of PKDD-01, 338349. Springer, 2001.
[30] H. Ritter. Self-organizing maps on non-Euclidian spaces. In: E. Oja and S. Kaski,editors, Kohonen Maps, pages 97110. Elsevier, 1999.
[31] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-OrganizingMaps: An Introduction, Addison-Wesley, 1992.
[32] Ron, D., Singer, Y., and Tishby, N. (1996). The power of amnesia. Machine Learning,25:117-150.
[33] J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in anauxiliary space. Neural Computation, 14:217239, 2002.
[34] P. Sommervuo. Self-organizing maps for signal and symbokl sequences, PhD thesis,Helsinki University of Technology, 2000.
[35] A. Sperduti. Neural networks for adaptive processing of structured data. In Proc.ICANN 2001, 512. Springer, 2001.
[36] M. Strickert, T. Bojer, and B. Hammer. Generalized relevance LVQ for time series. InProc. ICANN2001, 677638. Springer, 2001.
29
8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing
30/30
[37] M. Strickert and B. Hammer. Neural Gas for Sequences. In Proc. WSOM03, 53-57,2003.
[38] A. Ultsch and C. Vetter. Selforganizing Feature Maps versus Statistical Clustering:A Benchmark. Research Report No. 9, Dep. of Mathematics, University of Marburg1994.
[39] M. Varsta, J. del R. Milan, and J. Heikkonen. A recurrent self-organizing map fortemporal sequence processing. In Proc. ICANN97, 421426. Springer, 1997.
[40] M. Varsta, J. Heikkonen, and J. Lampinen. Analytical comparison of the temporal
Kohonen map and the recurrent self organizing map. M. Verleysen (ed.),ESANN2000, pages 273-280, De Facto, 2000.
[41] T. Voegtlin. Recursive self-organizing maps. Neur.Netw., 15(8-9):979991, 2002.
30