Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

  • Upload
    grettsz

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    1/30

    Unsupervised Recursive Sequence Processing

    Marc Strickert, Barbara Hammer

    Research group LNM, Department of Mathematics/Computer Science,

    University of Osnabr uck, Germany

    Sebastian Blohm

    Institute for Cognitive Science,

    University of Osnabr uck, Germany

    Abstract

    The self organizing map (SOM) is a valuable tool for data visualization and data mining for

    potentially high dimensional data of an a priori fixed dimensionality. We investigate SOMsfor sequences and propose the SOM-S architecture for sequential data. Sequences of poten-

    tially infinite length are recursively processed by integrating the currently presented item

    and the recent map activation, as proposed in [11]. We combine that approach with the

    hyperbolic neighborhood of Ritter [29], in order to account for the representation of pos-

    sibly exponentially increasing sequence diversification over time. Discrete and real-valued

    sequences can be processed efficiently with this method, as we will show in experiments.

    Temporal dependencies can be reliably extracted from a trained SOM. U-Matrix methods,

    adapted to sequence processing SOMs, allow the detection of clusters also for real-valued

    sequence elements.

    Key words: Self-organizing map, sequence processing, recurrent models, hyperbolicSOM, U-Matrix, Markov models

    1 Introduction

    Unsupervised clustering by means of the self organizing map (SOM) was first pro-

    posed by Kohonen [21]. The SOM makes the exploration of high dimensional data

    possible and it allows the exploration of the topological data structure. By SOMtraining, the data space is mapped to a typically two dimensional Euclidean grid

    Email address: {marc,hammer}@informatik.uni-osnabrueck.de(Marc Strickert, Barbara Hammer).

    Preprint submitted to Elsevier Science 23 January 2004

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    2/30

    of neurons, preferably in a topology preserving manner. Prominent applications of

    the SOM are WEBSOM for the retrieval of text documents and PicSOM for the

    recovery and ordering of pictures [18,25]. Various alternatives and extensions to

    the standard SOM exist, such as statistical models, growing networks, alternative

    lattice structures, or adaptive metrics [3,4,19,27,28,30,33].

    If temporal or spatial data are dealt with like time series, language data, or DNA

    strings sequences of potentially unrestricted length constitute a natural domain for

    data analysis and classification. Unfortunately, the temporal scope is unknown in

    most cases, and therefore fixed vector dimensions, as used for standard SOM, can-

    not be applied. Several extensions of SOM to sequences have been proposed; for

    instance, time-window techniques or the data representation by statistical features

    make a processing with standard methods possible [21,28]. Due to data selection or

    preprocessing, information might get lost; for this reason, a data-driven adaptation

    of the metric or the grid is strongly advisable [29,33,36]. The first widely used ap-

    plication of SOM in sequence processing employed the temporal trajectory of the

    best matching units of a standard SOM in order to visualize speech signals and the

    variations of which [20]. This approach, however, does not operate on sequences

    as they are; rather, SOM is used for reducing the dimensionality of single sequence

    entries and acts as a preprocessing mechanism this way. Proposed alternatives sub-stitute the standard Euclidean metric by similarity operators on sequences by in-

    corporating autoregressive processes or time warping strategies [16,26,34]. These

    methods are very powerful, but a major problem is their computational costs.

    A fundamental way for sequence processing is a recursive approach. Supervised

    recurrent networks constitute a well-established generalization of standard feedfor-

    ward networks to time series; many successful applications for different sequence

    classification and regression tasks are known [12,24]. Recurrent unsupervised mod-

    els have also been proposed: the temporal Kohonen map (TKM) and the recurrent

    SOM (RSOM) use the biologically plausible dynamics of leaky integrators [8,39],

    as they occur in organisms, and explain phenomena such as direction selectivityin the visual cortex [9]. Furthermore, the models have been applied with moderate

    success to learning tasks [22]. Better results have been achieved by integrating these

    models into more complex systems [7,17]. Recent more powerful approaches are

    the recursive SOM (RecSOM) and the SOM for structured data (SOMSD) [10,41].

    These are based on a richer and explicit representation of the temporal context: they

    use the activation profile of the entire map or the index of the most recent winner.

    As a result, their representation ability is superior to RSOM and TKM.

    A proposal to put existing unsupervised recursive models into a taxonomy can be

    found in [1,2]. The latter article identifies the entity time context used by the mod-els as one of the main branches of the given taxonomy [2]. Although more general,

    the models are still quite diverse, and the recent developments of [10,11,35] are

    not included in the taxonomy. An earlier, simple, and elegant general description of

    recurrent models with an explicit notion of context has been introduced in [13,14].

    2

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    3/30

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    4/30

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    5/30

    place by the update rule

    wj = h(nhd(nj0, nj)) (si wj)

    whereby (0, 1) is the learning rate. The function h describes the amount ofneuron adaptation in the neighborhood of the winner: often the Gaussian bell func-

    tion h(x) = exp(x2/2) is chosen, of which the shape is narrowed during train-

    ing by decreasing to ensure the neuron specialization. The function nhd(nj, nk)which measures the degree of neighborhood of the neurons ni and nj within thelattice might be induced by the simple Euclidean distance between the neuron co-

    ordinates in a rectangular grid or by the shortest distance in a graph connecting the

    two neurons.

    Recursive models substitute the one-shot distance computation for a single entry

    si by a recursive formula over all entries of a given sequence s. For all models,sequences are presented recursively, and the current sequence entry si is processedin the context which is set by its predecessors si+1, si+2, . . ..

    2 The models differ

    with respect to the representation of the context and in the way that the context

    influences further computation.

    The Temporal Kohonen Map (TKM) computes the distance of s = (s1, . . . , st)from neuron nj labeled with wj R

    n by the leaky integration

    dTKM(s, nj) =t

    i=1

    (1 )i1si wj2

    where (0, 1) is a memory parameter [8]. A neuron becomes winner if thecurrent entry s1 is close to its weight wj as in standard SOM, and, in addition,

    the remaining sum (1 )s2 wj + (1 )2s3 wj + . . . is also small.This additional term integrates the distances of the neurons weight from previous

    sequence entries weighted by an exponentially decreasing decay factor (1 )i1.The context resulting from previous sequence entries is pointing towards neurons

    of which the weights have been close to previous entries. Thus, the winner is a

    neuron whose weight is close to the average presented signal for the recent time

    steps.

    The training for the TKM takes place by Hebbian learning in the same way as for

    the standard SOM, making well-matching neurons more similar to the current input

    than bad-matching neurons. At the beginning, weights wj are initialized randomlyand then iteratively adapted when data is presented. For adaptation assume that a

    2 We use reverse indexing of the sequence entries, s1 denoting the most recent entry,s2, s3, . . . its predecessors.

    5

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    6/30

    sequence s is given, with si denoting the current entry and nj0 denoting the bestmatching neuron for this time step. Then the weight correction term is

    wj = h(nhd(nj0, nj)) (si wj)

    As discussed in [23], the learning rule of TKM is unstable and leads to only subop-

    timal results. More advanced, the Recurrent SOM (RSOM) leaky integration first

    sums up the weighted directions and afterwards computes the distance [39]

    dRSOM(s, nj) =

    t

    i=1

    (1 )i1(si wj)

    2

    .

    It represents the context in a larger space than TKM since the vectors of directions

    are stored instead of the scalar Euclidean distance. More importantly, the training

    rule is changed. RSOM derives its learning rule directly from the objective to min-

    imize the distortion error on sequences and thus adapts the weights towards the

    vector of integrated directions:

    wj

    = h

    (nhd(nj0, n

    j)) y

    j(i)

    whereby

    yj(i) =t

    i=1

    (1 )i1(si wj) .

    Again, the already processed part of the sequence produces a context notion, and

    the neuron becomes the winner for the current entry of which the weight is most

    similar to the average entry for the past time steps. The training rule of RSOM takes

    this fact into account by adapting the weights towards this averaged activation.We will not refer to this learning rule in the following. Instead, the way in which

    sequences are represented within these two models, and the ways to improve the

    representational capabilities of such maps will be of interest.

    Assuming vanishing neighborhood influences for both cases TKM and RSOM,one can analytically compute the internal representation of sequences for these two

    models, i.e. weights with response optimum to a given sequence s = (s1, . . . , st):the weight w is optimum for which

    w =t

    i=1

    (1

    )i1si/

    t

    i=1

    (1

    )i1

    holds [40]. This explains the encoding scheme of the winner-takes-all dynamics

    of TKM and RSOM. Sequences are encoded in the weight space by providing a

    6

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    7/30

    recursive partitioning very much like the one generating fractal Cantor sets. As an

    example for explaining this encoding scheme, assume that binary sequences {0,1}l

    are dealt with. For = 0.5, the representation of sequences of fixed length l cor-responds to an encoding in a Cantor set: the interval [0, 0.5) represents sequenceswith most recent entry s1 = 0, interval [0.5, 1) contains only codes of sequenceswith most recent entry 1. Recursive decomposition of the intervals allows to re-

    cover further entries of the sequence: [0, 0.25) stands for the beginning 00. . . of asequence, [0.25, 0.5) stands for 01, [0.5, 0.75) for 10, and [0.75, 1) represents 11.By further subdivision, [0, 0.125) stands for the beginning 000. . ., [0.125, 0.25) for001, and so on. Similar encodings can be found for alternative choices of . Se-quences over discrete sets = {0, . . . , d} R can be uniquely encoded usingthis fractal partitioning if < 1/d. For larger , the subsets start to overlap, i.e.codes are no longer sorted according to their last symbols, and a code might stand

    for two or more different sequences. A very small 1/d, in turn, results in anonly sparsely used space; for example the interval (d , 1] does not contain a validcode. Note that the explicit computation of this encoding stresses the superiority

    of the RSOM learning rule compared to TKM update, as pointed out in [40]: the

    fractal code is a fixed point for the dynamics of RSOM training, whereas TKM

    converges towards the borders of the intervals, preventing the optimum fractal en-

    coding scheme from developing on its own.

    Fractal encoding is reasonable, but limited: it is obviously restricted to discrete

    sequence entries, and real values or noise might destroy the encoded information.

    Fractal codes do not differentiate between sequences of different length; e.g. the

    code 0 gives optimum response to 0,00, 000, and so forth. Sequences with thiskind of encoding cannot be distinguished. In addition, the number of neurons does

    not take influence on the expressiveness of the context space. The range in which

    sequences are encoded is the same as the weight space. Thus, both the size of the

    weight space and the computation accuracy are limiting the number of different

    contexts, independently of the number of neurons of the network.

    Based on these considerations, richer and in particular explicit representations of

    context have been proposed. The models that we introduce in the following ex-

    tend the parameter space of each neuron j by an additional vector cj , which isused to explicitly store the sequential context within which a sequence entry is ex-

    pected. Depending on the model, the context cj is contained in a representationspace with different dimensionality. However, in all cases this space is independent

    of the weight space and extends the expressiveness of the models in comparison

    to TKM and RSOM. For each model, we will define the basic ingredients: what is

    the space of context representations? How is the distance between a sequence entry

    and neuron j computed, taking into account its temporal context cj? How are theweights and contexts adapted?

    The Recursive SOM (RecSOM) [41] equips each neuron nj with a weight wj R

    n that represents the given sequence entry, as usual. In addition, a vector cj

    7

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    8/30

    RN is provided, N denoting the number of neurons, which explicitly represents

    the contextual map activation of all neurons in the previous time step. Thus, the

    temporal context is represented in this model in an N-dimensional vector space, Ndenoting the number of neurons. One can think of the context as an explicit storage

    of the activity profile of the whole map in the previous time step. More precisely,

    distance is recursively computed by

    dRecSOM((s1, . . . , st), nj) = 1s1 wj2 + 2CRecSOM(s2, . . . , st) cj

    2

    where 1, 2 > 0.

    CRecSOM(s) = (exp(dRecSOM(s, n1)), . . . , exp(dRecSOM(s, nN)))

    constitutes the context. Note that this vector is almost the vector of distances of all

    neurons computed in the previous time step. These are exponentially transformed

    to avoid an explosion of the values. As before, the above distance can be decom-

    posed into two parts: the winner computation similar to standard SOM, and, as in

    the case of RSOM and TKM, a term which assesses the context match. For Rec-

    SOM the context match is a comparison of the current context when processing

    the sequence, i.e. the vector of distances of the previous time step, and the expected

    context cj which is stored at neuron j. That is to say, RecSOM explicitly stores con-text vectors for each neuron and compares these context vectors to their expected

    contexts during the recursive computation. Since the entire map activation is taken

    into account, sequences of any given fixed length can be stored, if enough neurons

    are provided. Thus, the representation space for context is no longer restricted by

    the weight space and its capacity now scales with the number of neurons.

    For RecSOM, training is done in Hebbian style for both weights and contexts. De-

    note by nj0 the winner for sequence entry i, then the weight changes are

    wj = h(nhd(nj0, nj)) (si wj)

    and the context adaptation is

    cj = h(nhd(nj0, nj)) (CRecSOM(si+1, . . . , st) cj)

    The latter update rule makes sure that the context vectors of the winner neuron

    and its neighborhood become more similar to the current context vector CRecSOM,which is computed when the sequence is processed. The learning rates are ,

    (0, 1). As demonstrated in [41], this richer representation of context allows a betterquantization of time series data. In [41], various quantitative measures to evaluatetrained recursive maps are proposed, such as the temporal quantization error and

    the specialization of neurons. RecSOM turns out to be clearly superior to TKM and

    RSOM with respect to these measures in the experiments provided in [41].

    8

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    9/30

    However, the dimensionality of the context for RecSOM equals the number of neu-

    rons N, making this approach computationally quite costly. The training of veryhuge maps with several thousands of neurons is no longer feasible for RecSOM.

    Another drawback is given by the exponential activity transfer function in the term

    of CRecSOM RN: specialized neurons are characterized by the fact that they have

    only one or a few well-matching predecessors contributing values of about 1 to

    CRecSOM; however, for a large number N of neurons, the noise influence on CRecSOMfrom other neurons destroys the valid context information, because even poorly

    matching neurons contributing values of slightly above 0 are summed up in the

    distance computation.

    SOM for structured data (SOMSD) as proposed in [10,11] is an efficient and still

    powerful alternative. SOMSD represents temporal context by the corresponding

    winner index in the previous time step. Assume that a regular l-dimensional latticeof neurons is given. Each neuron nj is equipped with a weight wj R

    n and a

    value cj Rl which represents a compressed version of the context, the location

    of the previous winner within the map [10]. The space in which context vectors

    are represented is the vector space Rl for this model. The distance of sequence

    s = (s1, . . . , st) from neuron nj is recursively computed by

    dSOMSD((s1, . . . , st), nj) = 1s1 wj2 + 2CSOMSD(s2, . . . , sn) cj

    2

    where CSOMSD(s) equals the location of neuron nj with smallest dSOMSD(s, nj) in thegrid topology. Note that the context CSOMSD is an element in a low-dimensional vec-tor space, usually only R2. The distance between contexts is given by the Euclidean

    metric within this vector space. The learning dynamic of SOMSD is very similar

    to the dynamic of RecSOM: the current distance is defined as a mixture of two

    terms, the match of the neurons weight and the current sequence entry, and the

    match of the neurons context weight and the context currently computed in the

    model. Thereby, the current context is represented by the location of the winningneuron of the map in the previous time step. This dynamic imposes a temporal bias

    towards those neurons which context vector matches the winner location of the pre-

    vious time step. It relies on the fact that a lattice structure of neurons is defined and

    a distance measure of locations within the map is defined.

    Due to the compressed context information, this approach is very efficient in com-

    parison to RecSOM and also very large maps can be trained. In addition, noise

    is suppressed in this compact representation. However, more complex context in-

    formation is used than for TKM and RSOM, namely the location of the previous

    winner in the map. As for RecSOM, Hebbian learning takes place for SOMSD, be-cause weight vectors and contexts are adapted in a well-known correction manner,

    here by the formulas

    wj = h(nhd(nj0, nj)) (si wj)

    9

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    10/30

    and

    cj = h(nhd(nj0, nj)) (CSOMSD(si+1, . . . , st) cj)

    with learning rates , (0, 1). nj0 denotes the winner for sequence entry i.As demonstrated in [11], a generalization of this approach to tree structures can

    reliably model structured objects and their respective topological ordering.

    We would like to point out that, although these approaches seem different, theyconstitute instances of the same recursive computation scheme. As proved in [14],

    the underlying recursive update dynamics comply with

    d((s1, . . . , st), nj) = 1s1 wj2 + 2C(s2, . . . , sn) cj

    2

    in all the cases. Their specific similarity measures for weights and contexts are de-

    noted by the generic expression. The approaches differ with respect to theconcrete choice of the context C: TKM and RSOM refer to only the neuron itselfand are therefore restricted to local fractal codes within the weight space; RecSOM

    uses the whole map activation, which is powerful but also expensive and subjectto random neuron activations; SOMSD relies on compressed information, the lo-

    cation of the winner. Note that also standard supervised recurrent networks can be

    put into the generic dynamic framework by choosing the context as the output of

    the sigmoidal transfer function [14]. In addition, alternative compression schemes,

    such as a representation of the context by the winner content, are possible [37].

    To summarize this section, essentially four different models have been proposed

    for processing temporal information. The models are characterized by the way in

    which context is taken into account within the map. The models are:

    Standard SOM: no context representation; standard distance computation; stan-

    dard competitive learning.TKM and RSOM: no explicit context representation; the distance computation

    recursively refers to the distance of the previous time step; competitive learning

    for the weight whereby (for RSOM) the averaged signal is used.

    RecSOM: explicit context representation as N-dimensional activity profile of theprevious time step; the distance computation is given as mixture of the current

    match and the match of the context stored at the neuron and the (recursively com-

    puted) current context given by the processed time series; competitive learning

    adapts the weight and context vectors.

    SOMSD: explicit context representation as low-dimensional vector, the location

    of the previously winning neuron in the map; the distance is computed recur-sively the same way as for RecSOM, whereby a distance measure for locations

    in the map has to be provided; so far, the model is only available for standard

    rectangular Euclidean lattices; competitive learning adapts the weight and con-

    text vectors, whereby the context vectors are embedded in the Euclidean space.

    10

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    11/30

    In the following, we focus on the context representation by the winner index, as

    proposed in SOMSD. This scheme offers a compact and efficient context repre-

    sentation. However, it relies heavily on the neighborhood structure of the neurons,

    and faithful topological ordering is essential for appropriate processing. Since for

    sequential data, like for words in , the number of possible strings is an expo-nential function of their length, an Euclidean target grid with inherent power law

    neighborhood growth is not suited for a topology preserving representation. The

    reason for this is that the storage of temporal data is related to the representation

    of trajectories on the neural grid. String processing means beginning at a node that

    represents the start symbol; then, how many nodes ns can in the ideal case uniquelybe reached in a fixed number s of steps? In grids with 6 neurons per neighbor thetriangular tessellation of the Euclidean plane leads to a hexagonal superstructure,

    inducing the surprising answer of ns = 6 for any choice of s > 0. Providing 7neurons per neighbor yields the exponential branching ns = 7 2

    (s1) of paths.

    In this respect, it is interesting to note that RecSOM can also be combined with

    alternative lattice structures; in [41] a comparison is presented of RecSOM with a

    standard rectangular topology and a data optimum topology provided by neural gas

    (NG) [27,28]. The latter clearly leads to superior results. Unfortunately, it is not

    possible to combine the optimum topology of NG with SOMSD: for NG, no gridwith straightforward neuron indexing exists. Therefore, context cannot be defined

    easily by referring back to the previous winner, because no similarity measure is

    available for indices of neurons within a grid topology.

    Here, we extend SOMSD to grid structures with triangular grid connectivity in

    order to obtain a larger flexibility for the lattice design. Apart from the standard

    Euclidean plane, the sphere and the hyperbolic plane are alternative popular two-

    dimensional manifolds. They differ from the Euclidean plane with respect to their

    curvature: the Euclidean plane is flat, whereas the hyperbolic space has negative

    curvature, and the sphere is curved positively. By computing the Euler characteris-

    tics of all compact connected surfaces, it can be shown that only seven have non-negative curvature, implying that all but seven are locally isometric to the hyper-

    bolic plane, which makes the study of hyperbolic spaces particularly interesting. 3

    The curvature has consequences on regular tessellations of the referred manifolds as

    pointed out in [30]: the number of neighbors of a grid point in a regular tessellation

    of the Euclidean plane follows a power law, whereas the hyperbolic plane allows

    an exponential increase of the number of neighbors. The sphere yields compact

    lattices with vanishing neighborhoods, whereby a regular tessellation for which all

    vertices have the same number of neighbors is impossible (with the uninteresting

    exception of an approximation by one of the 5 Platonic solids). Since all these

    surfaces constitute two-dimensional manifolds, they can be approximated locally

    within a cell of the tessellation by a subset of the standard Euclidean plane without

    3 For an excellent tool box and introduction to hyperbolic geometry see e.g.

    http://www.geom.uiuc.edu/docs/forum/hype/hype.html

    11

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    12/30

    too much contortion. A global isometric embedding, however, is not possible in

    general. Interestingly, for all such tessellations a data similarity measure is defined

    and possibly non-isometric visualization in the 2D plane can be achieved. While 6

    neighbors per neuron lead to standard Euclidean triangular meshes, for a grid with

    7 neighbors or more, the graph becomes part of the 2-dimensional hyperbolic plane.

    As already mentioned, exponential neighborhood growth is possible and hence an

    adequate data representation can be expected for the visualization of domains with

    a high connectivity of the involved objects. SOM with hyperbolic neighborhood

    (HSOM) has already proved well-suited for text representation as demonstrated for

    a non-recursive model in [29].

    3 SOM for sequences (SOM-S)

    In the following, we introduce the adaptation of SOMSD for sequences and the

    general triangular grid structure, SOM for sequences (SOM-S). Standard SOMs

    operate on a rectangular neuron grid embedded in a real-valued vector space. More

    flexibility for the topological setup can be obtained by describing the grid in termsof a graph: neural connections are realized by assigning each neuron a set of direct

    neighbors. The distance of two neurons is given by the length of a shortest path

    within the lattice of neurons. Each edge is assigned the unit length 1. The number ofneighbors might vary (also within a single map). Less than 6 neighbors per neuron

    lead to a subsiding neighborhood, resulting in graphs with small numbers of nodes.

    Choosing more than 6 neighbors per neuron yields, as argued above, an exponentialincrease of the neighborhood size, which is convenient for representing sequences

    with potentially exponential context diversification.

    Unlike standard SOM or HSOM, we do not assume that a distance preserving em-

    bedding of the lattice into the two dimensional plane or another globally parame-terized two-dimensional manifold with global metric structure, such as the hyper-

    bolic plane, exists. Rather, we assume that the distance of neurons within the grid

    is computed directly on the neighborhood graph, which might be obtained by any

    non-overlapping triangulation of the topological two-dimensional plane. 4 For our

    experiments, we have implemented a grid generator for a circular triangle mesh-

    ing around a center neuron, which requires the desired number of neurons and the

    neighborhood degree n as parameters. Neurons at the lattice edge possess less thann neighbors, and if the chosen total number of neurons does not lead to filling upthe outer neuron circle, neurons there are connected to others in a maximum sym-

    metric way. Figure 1 shows a small map with 7 neighbors for the inner neurons,and a total of 29 neurons perfectly filling up the outer edge. For 7 neighbors, theexponential neighborhood increase can be observed, for which an embedding into

    4 Since the lattice is fixed during training, these values have to be computed only once.

    12

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    13/30

    N3

    N2

    N1

    n

    13

    12

    D1

    D2

    Fig. 1. Hyperbolic self organizing map with context. Neuron n refers to the context givenby the winner location in the map, indicated by the triangle of neurons N1, N2, and N3,

    and the precise coordinates 12,13. If the previous winner has been D2, adaptation of the

    context along the dotted line takes place.

    the Euclidean plane is not possible without contortions; however, local projections

    in terms of a fish eye magnification focus can be obtained (cf. [29]).

    SOMSD adapts the location of the expected previous winner during training. For

    this purpose, we have to embed the triangular mesh structure into a continuous

    space. We achieve this by computing lattice distances beforehand, and then we ap-

    proximate the distance of points within a triangle shaped map patch by the standard

    Euclidean distance. Thus, positions in the lattice are represented by three neuron

    indices which represent the selected triangle of adjacent neurons, and two real num-

    bers which represent the position within the triangle. The recursive nature of the

    shown map is illustrated exemplarily in figure 1 for neuron n. This neuron n isequipped with a weight w Rn and a context c that is given by a location withinthe triangle of neurons N1, N2, and N3 expressing corner affinities by means of

    the linear combination parameters 12 and 13. The distance of a sequence s fromneuron n is recursively computed by

    dSOM-S((s1, . . . , st), n) = s1 w2 + (1 ) g(CSOM-S(s2, . . . , sn), c).

    CSOM-S(s) is the index of the neuron nj in the grid with smallest distance dSOM-S(s, nj).g measures the grid distance of the triangular position cj = (N1,N2,N3,12,13)to the winner as the shortest possible path in the mesh structure. Grid distances

    between neighboring neurons possess unit length, and the metric structure within

    the triangleN1,N2,N3 is approximated by the Euclidean metric. The range of gis normalized by scaling with the inverse maximum grid distance. This mixture of

    hyperbolic grid distance and Euclidean distance is valid, because the hyperbolic

    space can locally be approximated by Euclidean space, which is applied for com-

    putational convenience to both distance calculation and update.

    13

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    14/30

    Training is carried out by presenting a pattern s = (s1, . . . , st), determining thewinner nj0 , and updating the weight and the context. Adaptation affects all neuronson the breadth first search graph around the winning neuron according to their

    grid distances in a Hebbian style. Hence, for the sequence entry si, weight wj isupdated by wj = h(nhd(nj0, nj)) (si wj). The learning rate is typicallyexponentially decreased during training; as above, h(nhd(nj0, nj)) describes theinfluence of the winner nj0 to the current neuron nj as a decreasing function ofgrid distance. The context update is analogous: the current context, expressed in

    terms of neuron triangle corners and coordinates, is moved towards the previous

    winner along a shortest path. This adaptation yields positions on the grid only.

    Intermediate positions can be achieved by interpolation: if two neurons Ni and Njexist in the triangle with the same distance, the midway is taken for the flat grids

    obtained by our grid generator. This explains why the update path, depicted as the

    dotted line in figure 1, for the current context towards D2 is via D1. Since the grid

    distances are stored in a static matrix, a fast calculation of shortest path lengths is

    possible. The parameter in the recursive distance calculations controls the balancebetween pattern and context influence; since initially nothing is known about the

    temporal structure, this parameter starts at 1, thus indicating the absence of context,and resulting in standard SOM. During training it is decreased to an application

    dependent value that mediates the balance between the externally presented patternand the internally gained model about historic contexts.

    Thus, we can combine the flexibility of general triangular and possibly hyperbolic

    lattice structures with the efficient context representation as proposed in [11].

    4 Evaluation measures of SOM

    Popular methods to evaluate the standard SOM are the visual inspection, the identi-

    fication of meaningful clusters, the quantization error, and measures for topological

    ordering of the map. For recursive self organizing maps, an additional dimension

    arises: the temporal dynamic stored in the context representations of the map.

    4.1 Temporal quantization error

    Using ideas of Voegtlin [41] we introduce a method to assess the implicit repre-

    sentation of temporal dependencies in the map, and to evaluate to which amount

    faithful representation of the temporal data takes place. The general quantizationerror refers to the distortion of each map unit with respect to its receptive field,

    which measures the extent of data space coverage by the units. If temporal data are

    considered, the distortion needs to be assessed back in time. For a formal defini-

    tion, assume that a time series (s1, s2, . . . , st, . . .) is presented to the network, again

    14

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    15/30

    with reverse indexing notation, i.e. s1 is the most recent entry of the time series. Letwini denote all time steps for which neuron i becomes the winner in the consideredrecursive map model. The mean activation of neuron i for time step t in the past isthe value

    Ai(t) =

    jwini

    sj+t/|wini|.

    Assume that neuron i becomes winner for a sequence entry sj . It can then be ex-pected that sj is like the standard SOM close to the average Ai(0), because the mapis trained with Hebbian learning. Temporal specification takes place if, in addition,

    sj+t is close to the average Ai(t) for t > 0. The temporal quantization error ofneuron i at time step t back in the past is defined by

    Ei(t) =

    jwini

    sj+t Ai(t)2

    1/2

    .

    This measures the extent up to which the values observed t time steps back in the

    past coincide with a winning neuron. Temporal specialization of neuron i takesplace if Ei(t) is small for t > 0. Since no temporal context is learned for thestandard SOM, the temporal quantization will be large for t > 0, just reflectingspecifics of the underlying time series such as smoothness or periodicity. For re-

    cursive models, this quantity allows to assess the amount of temporal specification.

    The temporal quantization error of the entire map for t time steps back into the pastis defined as the average

    E(t) =Ni=1

    Ei(t)/N

    This method allows to evaluate whether the temporal dynamic in the recent past is

    faithfully represented.

    4.2 Temporal models

    After the training of a recursive map, it can be used to obtain an explicit, possibly

    approximative description of the underlying global temporal dynamics. This offers

    another possibility to evaluate the dynamics of SOM because we can compare the

    extracted temporal model to the original one, if available, or a temporal modelextracted directly from the data. In addition, a compressed description of the global

    dynamics extracted from a trained SOM is interesting for data mining tasks. In

    particular, it can be tested whether clustering properties of SOM, referred to by

    U-matrix methods, transfer to the temporal domain.

    15

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    16/30

    Markov models constitute simple, though powerful techniques for sequence pro-

    cessing and analysis [6,32]. Assume that = {a1, . . . , ad} is a finite alphabet. Theprediction of the next symbol refers to the task to anticipate the probability of aihaving observed a sequence s = (s1, . . . , st)

    before. This is just the condi-

    tional probability P(ai|s). For finite Markov models, a finite memory length l issufficient to determine this probability, i.e. the probability

    P(ai|(s1, . . . , sl, . . . , st)) = P(ai|(s1, . . . , sl)) , (t l)

    depends only on the past l symbols instead of the whole context (s1, . . . , st). Markovmodels can be estimated from given data if the order l is fixed. It holds that

    P(ai|(s1, . . . , sl)) =P((ai, s1, . . . , sl))j P((aj, s1, . . . , sl))

    (1)

    which means that the next symbol probability can be estimated from the frequencies

    of(l + 1)-grams.

    We are interested in the question whether a trained SOM-S can capture the es-sential probabilities for predicting the next symbol, generated by simple Markov

    models. For this purpose, we train maps on Markov models and afterwards extract

    the transition probabilities entirely from the obtained maps. This extraction can be

    done because of the specific form of context for SOM-S. Given a finite alphabet

    = {a1, . . . , ad} for training, most neurons specialize during training and becomewinner for at least one or some stimuli. Winner neurons represent the input se-

    quence entries w by their trained weight vectors. Usually, the weight wi of neuronni is very close to a symbol aj of and can thus be identified with the symbol.In addition, the neurons represent their context by an explicit reference to the lo-

    cation of the winner in the previous time step. The context vectors stored in the

    neurons define an intermediate winning position in the map encoded by the pa-

    rameters (N1,N2,N3,12,13) for the closest three neurons and the exact positionwithin the triangle. We take this into account for extracting sequences correspond-

    ing to the averaged weights of all three potential winners of the previous time step.

    For the averaging, the contribution of each neuron to the interpolated position is

    considered. Repeating this back-referencing procedure recursively for each winner

    weighted by its influence, yields an exponentially spreading number of potentially

    infinite time series for each of neuron. This way, we obtain a probability distribution

    over time series that is representative for the history of each map neuron. 5

    5 Interestingly, one can formally prove that every finite length Markov model can be ap-proximated by some map in this way in principle, i.e. for every Markov model of length la map exists such that the above extraction procedure yields the original model up to small

    deviations. Assume a fixed length l and a rational P(ai|(s1, . . . , sl)) and denote by q thesmallest common denominator of the transition probabilities. Consider a map in which for

    16

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    17/30

    The number of specialized neurons for each time series is correlated to the proba-

    bility of these stimuli in the original data source. Therefore, we can simply take the

    mean of the probabilities for all neurons and obtain a global distribution over all

    histories which are represented in the map. Since standard SOM has a magnification

    factor different from 1, the number of neurons, which represent a symbol ai, devi-ates from the probability for ai in the given data [31]. This leads to a slightly biasedestimation of the sequence probabilities represented by the map. Nevertheless, we

    will use the above extraction procedure as a sufficiently close approximation to the

    true underlying distribution. This compromise is taken, because the magnification

    factor for recurrent SOMs is not known and techniques from [31] for its compu-

    tation cannot be transferred to recurrent models. Our experiments confirm that the

    global trend is still correct. We have extracted for every finite memory length l theprobability distribution for words in l+1 as they are represented in the map anddetermined the transition probabilities of equation 1.

    The method as described above is a valuable tool to evaluate the representation

    capacity of SOM for temporal structures. Obviously, fixed order Markov models

    can be better extracted directly from the given data, avoiding problems such as the

    magnification factor of SOM. Hence, this method just serves as an alternative for

    the evaluation of temporal self-organizing maps and their capability of representingtemporal dynamics. The situation is different if real-valued elements are processed,

    like in the case of obtaining symbolic structure from noisy sequences. Then, a rea-

    sonable quantization of the sequence entries must be found before a Markov model

    can be extracted from the data. The standard SOM together with U-matrix methods

    provides a valuable tool to find meaningful clusters in a given set of continuous

    data. It is an interesting question whether this property transfers to the temporal

    domain, i.e. whether meaningful clusters of real-valued sequence entries can also

    be extracted from a trained recursive model. SOM-S allows to combine both reli-

    able quantization of the sequence entries and the extraction mechanism for Markov

    models to take into account the temporal structure of the data.

    For the extraction we extend U-Matrix methods to recursive models as follows [38]:

    the standard U-Matrix assigns to each neuron the averaged distance of its weight

    vector compared to its direct lattice neighbors:

    U(ni) =

    nhd(ni,nj)=1

    wi wj

    each symbol ai a cluster of neurons with weights wj = ai exist. These main clusters aredivided into subclusters enumerated by s = (s1, . . . , sl)

    l with q P(ai|s) neurons for

    each possible s. The context of each of such neuron refers to another neuron within a clusterbelonging to s1 and to a subcluster belonging to (s2, . . . , sl, sl+1) for some arbitrary sl+1.Note that the clusters can thereby be chosen contiguous on the map respecting the topolog-

    ical ordering of the neurons. The extraction mechanism leads to the original Markov model

    (with rational probabilities) based on this map.

    17

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    18/30

    In a trained map, neurons spread in regions of the data space where a high sample

    density can be observed, resulting in large U-values at borders between clusters.

    Consequently, the U-Matrix forms a 3D landscape on the lattice of neurons with

    valleys corresponding to meaningful clusters and hills at the cluster borders. The

    U-Matrix of weight vectors can be constructed also for SOM-S. Based on this ma-

    trix, the sequence entries can be clustered into meaningful categories, based on

    which the extraction of Markov models as described above is possible. Note that

    the U-Matrix is built by using the weights assigned to the neurons only, while the

    context information of SOM-S is yet ignored. 6 However, since context informa-

    tion is used for training, clusters emerge which are meaningful with respect to the

    temporal structure, and this way they contribute implicitly to the topological order-

    ing of the map and to the U-Matrix. Partially overlapping, noisy, and ambiguous

    input elements are separated during the training, because the different temporal

    contexts contain enough information to activate and produce characteristic clusters

    on the map. Thus, the temporal structure captured by the training allows a reliable

    reconstruction of the input sequences, which could not have been achieved by the

    standard SOM architecture.

    5 Experiments

    5.1 Mackey-Glass time series

    The first task is to learn the dynamic of the real-valued chaotic Mackey-Glass time

    series dxd

    = bx() + ax(d)1+x(d)10

    using a = 0.2, b = 0.1, d = 17. This is the same

    setup as given in [41] making a comparison of the results possible. 7 Three types

    of maps with 100 neurons have been trained: a 6-neighbor map without contextgiving standard SOM, a map with 6 neighbors and with context (SOM-S), and

    a 7-neighbor map providing a hyperbolic grid with context utilization (H-SOM-S). Each run has been computed with 1.5 105 presentations starting at randompositions within the Mackey-Glass series using a sample period of t = 3; theneuron weights have been initialized white within [0.6, 1.4]. The context has beenconsidered by decreasing the parameter from = 1 to = 0.97. The learning rateis exponentially decreased from 0.1 to 0.005 for weight and context update. Initialneighborhood cooperativity is 10 which is annealed to 1 during training.

    Figure 2 shows the temporal quantization error for the above setups: the temporal

    quantization error is expressed by the average standard deviation of the given se-

    quence and the mean unit receptive field for 29 time steps into the past. Similar

    6 Preliminary experiments indicate that the context also orders topologically and yields

    meaningful clusters. The number of neurons in context clusters is thereby small compared

    to the number of neurons and statistically significant results could not be obtained.7 We would like to thank T.Voegtlin for providing data for comparison.

    18

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    19/30

    to Voegtlins results, we observe large cyclic oscillations driven by the periodicity

    of the training series for standard SOM. Since SOM does not take contextual in-

    formation into account, this quantization result can be seen as an upper bound for

    temporal models, at least for the indices > 0 reaching into the past (trivially, SOMis a very good quantizer of scalar elements without history); the oscillating shape

    of the curve is explained by the continuity of the series and its quasi-periodic dy-

    namic, and extrema exist rather by the nature of the series than by special model

    properties. Obviously, the very restricted context of RSOM does not yield a long

    term improvement of the temporal quantization error. However, the displayed er-

    ror periodicity is anti-cyclic compared to the original series. Interestingly, the data

    optimum topology of neural gas (NG), which also does not take contextual infor-

    mation into account, allows a reduction of the overall quantization error; however,

    the main characteristics, such as the periodicity, remain the same as for standard

    SOM. RecSOM leads to a much better quantization error than RSOM and also NG.

    Thereby, the error is minimum for the immediate past (left side of the diagram),

    and increases for going back in time, which is reasonable because of the weighting

    of context influence by (1 ). The increase of the quantization error is smoothand the final values after 29 time steps is better than the default given by standardSOM. In addition, almost no periodicity can be observed for RecSOM. SOM-S

    and H-SOM-S further improve the results: only some periodicity can be observed,and the overall quantization error increases smoothly for the past values. Note that

    these models are superior to RecSOM in this task while requiring less computa-

    tional power. H-SOM-S allows a slightly better representation of the immediate

    past compared to SOM-S due to the hyperbolic topology of the lattice structure

    that matches better the characteristics of the input data.

    0

    0.05

    0.1

    0.15

    0.2

    0 5 10 15 20 25 30

    QuantizationError

    Index of past inputs (index 0: present)

    * SOM* RSOM

    NG* RecSOM

    H-SOM-SSOM-S

    Fig. 2. Temporal quantization errors of different model setups for the Mackey-Glass series.

    Results indicated by are taken from [41].

    19

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    20/30

    5.2 Binary automata

    The second experiment is also inspired by Voegtlin. A discrete 0/1-sequence gener-

    ated by a binary automaton with P(0|1) = 0.4 and P(1|0) = 0.3 shall be learned.For discrete data, the specialization of a neuron can be defined as the longest se-

    quence that still leads to unambiguous winner selection. A high percentage of spe-

    cialized neurons indicates that temporal context has been learned by the map. In

    addition, one can compare the distribution of specializations with the original dis-tribution of strings as generated by the underlying probability. Figure 3 shows the

    specialization of a trained H-SOM-S. Training has been carried out with 3 106 pre-sentations, increasing the context influence (1 ) exponentially from 0 to 0.06.The remaining parameters have been chosen as in the first experiment. Finally, the

    receptive field has been computed by providing an additional number of 106 testiterations. Putting more emphasis on the context results in a smaller number of ac-

    tive neurons representing rather long strings that cover only a small part of the total

    input space. If a Euclidean lattice is used instead of a hyperbolic neighborhood,

    the resulting quantizers differ only slightly, which indicates that the representation

    of binary symbols and their contexts in the 2-dimensional output space represen-

    tations does barely benefit from exponential branching. In the depicted run, 64 ofthe neurons express a clear profile, whereas the other neurons are located at sparse

    locations of the input data topology, between cluster boundaries, and thus do not

    win for the presented stimuli. The distribution corresponds nicely to the 100 mostcharacteristic sequences of the probabilistic automaton as indicated by the graph.

    Unlike RecSOM (presented in [41]), also neurons at interior nodes of the tree are

    expressed for H-SOM-S. These nodes refer to transient states, which are repre-

    sented by corresponding winners in the network. RecSOM, in contrast to SOM-S,

    does not rely on the winner index only, but it uses a more complex representa-

    tion: since the transient states are spared, longer sequences can be expressed by

    RecSOM. In addition to the examination of neuron specialization, the whole map

    01

    23456789

    1011

    100 most likely sequencesH-SOM-S, 100 neurons64 specialized neurons

    Fig. 3. Receptive fields of a H-SOM-S compared to the most probable sub-sequences of the

    binary automaton. Left hand branches denote 0, right is 1.

    20

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    21/30

    Type P(0) P(1) P(0|0) P(1|0) P(0|1) P(1|1)

    Automaton 1 4/7 0.571 3/7 0.429 0.7 0.3 0.4 0.6

    Map (98/100) 0.571 0.429 0.732 0.268 0.366 0.634

    Automaton 2 2/7 0.286 5/7 0.714 0.8 0.2 0.08 0.92

    Map (138/141) 0.297 0.703 0.75 0.25 0.12 0.88

    Automaton 3 0.5 0.5 0.5 0.5 0.5 0.5

    Map (138/141) 0.507 0.493 0.508 0.492 0.529 0.471

    Table 1

    Results for binary automata extraction with different transition probabilities. The extracted

    probabilities clearly follow the original ones.

    representation can be characterized by comparing the input symbol transition statis-

    tics with the learned context-neuron relations. While the current symbol is coded

    by the winning neurons weight, the previous symbol is represented by the average

    of weights of the winners context triangle neurons. The obtained two values the

    neurons state and the average state of the neurons context are clearly expressed

    in the trained map: only few neurons contain values in an indeterminate interval

    [13

    , 23

    ], but most neurons specialize on very close to 0 or 1. Results for the recon-struction of three automata can be found in table 1. For the reconstruction we have

    used the algorithm described in section 4.2 with memory length 1. The left columnindicates the number of expressed neurons and the total number of neurons in the

    map. Note that the automata can be well reobtained from the trained maps. Again,

    the temporal dependencies are clearly captured by the maps.

    5.3 Reber grammar

    In a third experiment we have used more structured symbolic sequences as gener-

    ated by the Reber grammar illustrated in figure 4. The 7 symbols have been coded

    in a 6-dimensional Euclidean space by points that denote the same as a tetrahedron

    does with its four corners in three dimensions: all points have the same distance

    *

    8

    8

    2:

    :

    2

    56

    5

    6

    -

    Fig. 4. State graph of the Reber grammar.

    21

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    22/30

    from each other. For training and testing we have taken the concatenation of ran-

    domly generated words, such preparing sequences of 3 106 and 106 input vectors,respectively. The map has got a map radius of 5 and contains 617 neurons on anhyperbolic grid. For the initialization and the training, the same parameters as in the

    previous experiment were used, except for an initially larger neighborhood range of

    14, corresponding to the larger map. Context influence was taken into account bydecreasing from 1 to 0.8 during training. A number of 338 neurons developed aspecialization for Reber strings with an average length of7.23 characters. Figure 5shows that the neuron specializations produce strict clusters on the circular grid,

    ordered in a topological way by the last character. In agreement with the grammar,

    the letter T takes the largest sector on the map. The underlying hyperbolic lattice

    gives rise to sectors, because they clearly minimize the boundary between the 7

    classes. The symbol separation is further emphasized by the existence of idle neu-

    rons between the boundaries, which can be seen analogously to large values in a

    U-Matrix. Since neuron specialization takes place from the most common states

    which are the 7 root symbols to the increasingly special cases, the central nodes

    have fallen idle after they have served as signposts during training; finally the most

    specialized nodes with their associated strings are found at the lattice edge on the

    outer ring. Much in contrast to the such ordered hyperbolic target lattice, the re-

    sult for the Euclidean grid in figure 7 shows a neuron arrangement in the form ofpolymorphic coherent patches.

    Similar to the binary automata learning tasks, we have analyzed the map represen-

    tation by the reconstruction of the trained data by backtracking all possible context

    sequences of each neuron up to length 3. Only 118 of all 343 combinatorially pos-sible trigrams are realized. In a ranked table the most likely 33 strings cover allattainable Reber trigrams. In the log-probability plot 6 there is a leap between entry

    number 33 (TSS, valid) and 34 (XSX, invalid), emphasizing the presence of the Re-ber characteristic. The correlation of the probabilities of Reber trigrams and their

    relative frequencies found in the map is 0.75. An explicit comparison of the proba-

    bilities of valid Reber strings can be found in figure 8. The values deviate from thetrue probabilities, in particular for cycles of the Reber graph, such as consecutive

    letters T and S, or the VPX-circle. This effect is due to the magnification factor

    different from 1 for SOM, which further magnifies when sequences are processedin the proposed recursive manner.

    5.4 Finite memory models

    In a final series of experiments, we examine a SOM-S trained on Markov modelswith noisy input sequence entries. We investigate the possibility to extract tempo-

    ral dependencies on real-valued sequences from a trained map. The Markov model

    possesses a memory length of2 as depicted in figure 9. The basic symbols are de-noted by a, b, and c. These are embedded in two dimensions, disrupted by noise, as

    22

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    23/30

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    24/30

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    **

    *

    *

    *

    * *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    ***

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    **

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    XTVVEB

    SEBPV

    EBTXS

    XXTVVEB

    EBPV

    TVPS

    TVPSEB

    VPSEB

    EBTXXT

    TVVEBP

    EBTXXT

    EBP

    TVPXT

    TVVEBTX

    VPSEBTX

    VPXTTVPX

    EBPTTV

    XTTV

    VVEBP

    TTVVEBP

    VPXTTVV

    TTTTVVSSXS

    TVVEBT

    SEBTX

    TVPSEBT

    EBTX

    VPSEBT

    VPX

    EBTXXTT

    XTT

    EBTSSXXTVP

    TVPXTT

    TVP

    TVVE

    EBTSXSEB

    EBTXSE

    VVEBTXX

    TTV

    EBTSSXXTVPSE

    EBTXX

    EBPVVEBP

    SSXX

    EBTSX

    TTTTVVEBP

    VPXTTVPXT

    TTTV

    TVPXT

    TTTTV

    EBTSS

    EBTXSEBT

    EBTXS

    EBTXXTTVV

    BTSSXXTVPSEBPVVPSEBP

    TTVVEBPV

    EBPVVE

    TTTTVVE

    EBTSSXX

    VPXTTVVE

    VPXTTVVEBTXX

    EBTSXX

    TVPS

    EBTSSXXTVPS

    TVPX

    VVEBTS

    SSX

    EBTXXSEBTS

    TTVP

    EBTSSX

    SEBTXX

    TTTTVP

    TVPXT

    EBPVVEB

    TVP

    VVEBTSX

    TTTTVVEB

    SSXSEBT

    TVVEBPTV

    EBTSSXXTVPSEBP

    EBTSXSEBT

    EBTXSEBP

    VVEBPTT

    EBP

    EBPTT

    TTVVEBT

    EBTSSXXTT

    EBPVVPS

    TVPXTT

    SSXXTV

    VVEBPV

    TTVPS

    TTT

    XTV

    TVPXTV

    EBPVVE

    TTVVEBTS

    XTTVVE

    EBTS

    EBTXXTTV

    TTVVE

    EBTXSEBTS

    TTVV

    XTVV

    TVVE

    SEBPVV

    EBPVV

    XTVVE

    VVEB

    EBPTVPXT

    TSSXXTVPSEBPVP

    XXTVVE

    TTVVEB

    SEBPVP

    TVVEB

    TVPX

    SSXSEEBTSSXSE

    VVEBPTTVVEBPTEBPT VPXT

    VPXTTVVEBT

    TSSXXTVPSEBPVV

    EBPVV

    EBTXSEBT

    EBPVP

    EBPVP

    EBPVPX

    EBTSXSE

    SSXSEB

    TTVVEBPTEBPT

    VPXTTV

    EBTSSXXT

    EBTSXS

    EBPVPXT

    EBTSSXS

    TVVEBT

    TTVV

    XTTVV

    VVEBPTV

    EBPTV

    VVEBPVV

    TVV

    EBTSSXXTVPSEBT

    TVVEB

    SEBPTT

    XXTT

    EBPTTVP

    EBTXSEB

    VPXTT

    TVPXTVP

    TTT

    TVV

    SXSEBP

    TVPXTTTT

    TVVE

    VPXTTVP

    TTVP

    EBTXSE

    TVPEBPTVP

    VVEBT

    VPXTTVVEB

    EBTXXTV

    EBTSSXXTV

    VVEBTSS

    EBTXSEB

    EBPVPXTV

    EBTSSXXTVPSEB

    SSXXTVV

    XXTVV

    EBTX EBPVPXTVV

    SEBPT

    VPXTTVVEBTX

    VVEBTX

    XTTT

    SSXXT

    EBTSSXXTVPX

    TVPXTTT

    TVPSE

    TVPX

    TTTTTEBPTVPX

    TTTT

    VPSE

    TTV

    SSSS

    EBTSSS

    Fig. 7. Arrangement of Reber words on a Euclidean lattice structure. The words are ar-

    ranged according to their most recent symbols (shown on the right of the sequences).

    Patches emerge according to the most recent symbol. Within the patches, an ordering ac-

    cording to the preceding symbols can be observed.

    C

    Fig. 8. Frequency reconstruction of trigrams from the Reber grammar.

    24

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    25/30

    follows: a stands for (0, 0) + , b for (1, 0) + , and c for (0, 1) + , being inde-pendent Gaussian noise with standard deviation g, which is a variable to be testedin the experiments. The symbols are denoted right to left, i.e. ab indicates that thecurrently emitted symbol is a, after having observed symbol b in the previous step.Thus, b and c are always succeeded by a, whereas a is succeeded with probabilityx by b, and (1 x) by c assumed the past symbol was b, and vice versa, if thelast symbol was c. The transition probability x is varied between the experiments.We train a SOM-S with regular rectangular two-dimensional lattice structure and

    100 neurons for a generated Markov series. The context parameter was decreasedfrom = 0.97 to = 0.93, the neighborhood radius was decreased from = 5to = 0.5, the learning rate was annealed from 0.02 to 0.005. A number of 1000patterns are presented in 15000 cycles. U-Matrix clustering has been calculatedwith such a level of the landscape that half the neurons are contained in valleys.

    The neurons in the same valleys are assigned to belong to the same cluster, and the

    number of different clusters is determined. Afterwards, all the remaining neurons

    are assigned to their closest cluster.

    First, we choose a noise level of g = 0.1 such that almost no overlap can beobserved, and we investigate this setup with different x between 0 and 0.8. In all

    the results, three distinct clusters, corresponding to the three symbols, are foundwith the U-Matrix method. The extraction of the order 2 Markov models indicatesthat the global transition probabilities are correctly represented in the maps.Table 2

    shows the corresponding extracted probabilities. Thereby, the exact probabilities

    cannot be recovered because of a magnification factor of SOM different from 1.However, the global trend is clearly found and the extracted probabilities are in

    good agreement with the priorly chosen values.

    In a second experiment, the transition probability is fixed to x = 0.4, but the noiselevel is modified, choosing g between 0.1 and 0.5. All the training parameters arechosen as in the previous experiment. Note that a noise level g = 0.3 already yields

    much overlap of the classes, as depicted in figure 10. Nevertheless, three clusterscan be detected in all of the cases and the transition probabilities can be recovered,

    except for a noise level of 0.5 for which the training scenario degenerates to analmost deterministic case, making a the most dominant state. Table 3 summarizesthe extracted probabilities.

    1

    ac

    caba

    ab

    1x 1x

    1x x

    Fig. 9. Markov automaton with 3 basic states and a finite order of2 used to train the map.

    25

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    26/30

    Fig. 10. Symbols a, b, c which are embedded in R2 as a = (0, 0) + , b = (1, 0) + , andc = (0, 1) + , subject to noise with different variances: noise level are 0.1, 0.3, and 0.4.The latter two noise levels show considerable overlap of the classes which represent the

    symbol.

    x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    P(a|ab) 0 0.01 0 0.01 0 0.04 0 0.04 0.01

    P(b|ab) 0 0.08 0.3 0.31 0.38 0.55 0.68 0.66 0.78

    P(c|ab) 1 0.91 0.7 0.68 0.62 0.41 0.32 0.3 0.21

    P(a|ac) 0 0 0 0 0 0.01 0.01 0 0.01

    P(b|ac) 1 0.81 0.8 0.66 0.52 0.55 0.32 0.31 0.24

    P(c|ac) 0 0.19 0.2 0.34 0.48 0.44 0.67 0.69 0.75

    Table 2

    Transition probabilities extracted from the trained map. The noise level was fixed to 0.1and different generating transition probabilities x were used.

    noise 0.1 0.2 0.3 0.4 0.5 true

    P(a|ab) 0.01 0 0 0.1 0.98 0

    P(b|ab) 0.42 0.49 0.4 0.24 0.02 0.4

    P(c|ab) 0.57 0.51 0.6 0.66 0.02 0.6

    P(a|ac) 0.01 0 0 0.09 0 0

    P(b|ac) 0.59 0.6 0.44 0.39 0 0.6

    P(c|ac) 0.4 0.4 0.56 0.52 0 0.4

    Table 3

    Probabilities extracted from the trained map with fixed input transition probabilities and

    different noise levels. For a noise level of0.5

    , the extraction mechanism breaks down and

    the symbol a becomes most dominant. For smaller noise levels, extraction of the symbolscan still be done also for overlapping clusters because of temporal differentiation of the

    clusters in recursive models.

    26

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    27/30

    6 Conclusions

    We have presented a self organizing map with a neural back-reference to the pre-

    viously active sites and with a flexible topological structure of the neuron grid. For

    context representation, the compact and powerful SOMSD model as proposed in

    [11] has been used. Unlike TKM and RSOM, much more flexibility and expres-

    siveness is obtained, because the context is represented in the space spanned by

    the neurons, and not only in the domain of the weight space. Compared to Rec-

    SOM, which is based on very extensive contexts, the SOMSD model is much moreefficient. However, SOMSD requires an appropriate topological representation of

    the symbols, measuring distances of contexts in the grid space. We have therefore

    extended the map configuration to more general triangular lattices, thus, making

    also hyperbolic models possible as introduced in [30]. Our SOM-S approach has

    been evaluated on several data series including discrete and real-valued entries.

    Two experimental setups have been taken from [41] to allow a direct comparison

    with different models. As pointed out, the compact model introduced here improves

    the capacity of simple leaky integrator networks like TKM and RSOM and shows

    results competitive to the more complex RecSOM.

    Since the context of SOM-S directly refers to the previous winner, temporal con-texts can be extracted from a trained map. An extraction scheme to obtain Markov

    models of fixed order has been presented and its reliability has been confirmed in

    three experiments. As demonstrated, this mechanism can be applied to real-valued

    sequences, expanding U-Matrix methods to the recursive case.

    So far, the topological structure of context formation has not been taken into ac-

    count during the extraction. Context clusters, in addition to weight clusters, provide

    more information, which might be used for the determination of appropriate orders

    of the models, or for the extraction of more complex settings like hidden Markov

    models. We currently investigate experiments aiming at these issues. However, pre-

    liminary results indicate that Hebbian training, as introduced in this article, allowsthe reliable extraction of finite memory models only. More sophisticated training

    algorithms should be developed for more complex temporal dependencies.

    Interestingly, the proposed context model can be interpreted as the development

    of long range synaptic connections, leading to more specialized map regions. Sta-

    tistical counterparts to unsupervised sequence processing, like the Generative To-

    pographic Mapping Through Time (GTMTT) [5], incorporate similar ideas by de-

    scribing temporal data dependencies by hidden Markov latent space models. Such a

    context effects the prior distribution on the space of neurons. Due to computational

    restrictions, the transition probabilities of GTMTT are usually limited to only lo-

    cal connections. Thus, long range connections like in the presented context modeldo not emerge, rather visualizations similar (though more powerful) to TKM and

    RSOM arise. It could be interesting to develop more efficient statistical counter-

    parts, which also allow the emergence of interpretable long range connections such

    as those of the deterministic SOM-S.

    27

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    28/30

    References

    [1] G. Barreto and A. Araujo. Time in self-organizing maps: An overview of models. Int.Journ. of Computer Research, 10(2):139179, 2001.

    [2] G. de A. Barreto, A. F. R. Araujo, and S. C. Kremer. A taxonomy for spatiotemporalconnectionist networks revisited: the unsupervised case. Neural Computation,15(6):1255 - 1320, 2003.

    [3] H.-U. Bauer and T. Villmann. Growing a hypercubical output space in a self-

    organizing feature map. IEEE Transactions on Neural Networks, 8(2):218226, 1997.

    [4] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: the generative topographicmapping. Neural Computation 10(1):215-235, 1998.

    [5] C. M. Bishop, G. E. Hinton, and C. K. I. Williams. GTM through time. ProceedingsIEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K.,pages 111-116, 1997.

    [6] Buhlmann, P., and Wyner, A.J. (1999). Variable length Markov chains. Annals ofStatistics, 27:480-513.

    [7] O. A. Carpinteiro. A hierarchical self-organizing map for sequence recognition.Neural Processing Letters, 9(3):209-220, 1999.

    [8] G. Chappell and J. Taylor. The temporal Kohonen map. Neural Networks, 6:441445,1993.

    [9] I. Farkas and R. Miikkulainen. Modeling the self-organization of directionalselectivity in the primary visual cortex. Proceedings of ICANN99, Edinburgh,Scotland, pp. 251-256, 1999.

    [10] M. Hagenbuchner, A. C. Tsoi, and A. Sperduti. A supervised self-organising map forstructured data. In N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances inSelf-Organising Maps, 2128. Springer, 2001.

    [11] M. Hagenbuchner, A. Sperduti, and A.C. Tsoi. A Self-Organizing Map for AdaptiveProcessing of Structured Data. IEEE Transactions on Neural Networks, 14(3):491505, 2003.

    [12] B. Hammer. On the learnability of recursive data. Mathematics of Control Signals andSystems, 12:6279, 1999.

    [13] B. Hammer, A. Micheli, and A. Sperduti. A general framework for unsupervisedprocessing of structured data. In M. Verleysen, editor, European Symposium onArtificial Neural Networks2002, 389394. D Facto, 2002.

    [14] B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A general framework forunsupervised processing of structured data. To appear in: Neurocomputing.

    [15] B. Hammer, A. Micheli, A. Sperduti. A general framework for self-organizingstructure processing neural networks. Technical report TR-03-04 of the Universitadi Pisa, 2003.

    [16] J. Joutsensalo and A. Miettinen. Self-organizing operator map for nonlinear dimensionreduction. Proceedings ICNN95, 1:111-114, IEEE, 1995.

    [17] J. Kangas. On the analysis of pattern sequences by self-organizing maps. PhD thesis,Helsinki University of Technology, Espoo, Finland, 1994.

    28

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    29/30

    [18] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM self-organizing mapsof document collections. Neurocomputing, 21(1):101-117, 1998.

    [19] S. Kaski and J. Sinkkonen. A topography-preserving latent variable model withlearning metrics. In: N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advancesin Self-Organizing Maps, pages 224229, Springer, 2001.

    [20] T. Kohonen. The neural phonetic typewriter. Computer, 21(3):11-22, 1988.

    [21] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 2001.

    [22] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Recurrent SOM with local linear

    models in time series prediction. In M.Verleysen, editor, 6th European Symposium onArtificial Neural Networks,pages 167172, De facto, 1998.

    [23] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Time series prediction usingrecurrent SOM with local linear models. International Journal of Knowledge-basedIntelligent Engineering Systens 2(1):60-68, 1998.

    [24] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review.Neural Computation, 13(2):249306, 2001.

    [25] J. Laaksonen, M. Koskela, S. Laakso, and E. Oja. PicSOM content-based imageretrieval with self-organizing maps. Pattern Recognition Letters, 21(13-14):1199-1207, 2000.

    [26] J. Lampinen and E. Oja. Self-organizing maps for spatial and temporal AR models.M. Pietikainen and J. Roning (eds.), Proceedings 6 SCIA, 120-127, Helsinki, Finland,1989.

    [27] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507522, 1994.

    [28] T. Martinetz, S.G. Berkovich, and K.J. Schulten. Neural-gas networks for vectorquantization and its application to time-series prediction. IEEE Transactions onNeural Networks, 4(4):558569, 1993.

    [29] J. Ontrup and H. Ritter. Text categorization and semantic browsing with self-organizing maps on non-euclidean spaces. In L. D. Raedt and A. Siebes, editors,Proceedings of PKDD-01, 338349. Springer, 2001.

    [30] H. Ritter. Self-organizing maps on non-Euclidian spaces. In: E. Oja and S. Kaski,editors, Kohonen Maps, pages 97110. Elsevier, 1999.

    [31] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-OrganizingMaps: An Introduction, Addison-Wesley, 1992.

    [32] Ron, D., Singer, Y., and Tishby, N. (1996). The power of amnesia. Machine Learning,25:117-150.

    [33] J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in anauxiliary space. Neural Computation, 14:217239, 2002.

    [34] P. Sommervuo. Self-organizing maps for signal and symbokl sequences, PhD thesis,Helsinki University of Technology, 2000.

    [35] A. Sperduti. Neural networks for adaptive processing of structured data. In Proc.ICANN 2001, 512. Springer, 2001.

    [36] M. Strickert, T. Bojer, and B. Hammer. Generalized relevance LVQ for time series. InProc. ICANN2001, 677638. Springer, 2001.

    29

  • 8/3/2019 Marc Strickert, Barbara Hammer and Sebastian Blohm- Unsupervised Recursive Sequence Processing

    30/30

    [37] M. Strickert and B. Hammer. Neural Gas for Sequences. In Proc. WSOM03, 53-57,2003.

    [38] A. Ultsch and C. Vetter. Selforganizing Feature Maps versus Statistical Clustering:A Benchmark. Research Report No. 9, Dep. of Mathematics, University of Marburg1994.

    [39] M. Varsta, J. del R. Milan, and J. Heikkonen. A recurrent self-organizing map fortemporal sequence processing. In Proc. ICANN97, 421426. Springer, 1997.

    [40] M. Varsta, J. Heikkonen, and J. Lampinen. Analytical comparison of the temporal

    Kohonen map and the recurrent self organizing map. M. Verleysen (ed.),ESANN2000, pages 273-280, De Facto, 2000.

    [41] T. Voegtlin. Recursive self-organizing maps. Neur.Netw., 15(8-9):979991, 2002.

    30