21
Intelligent Database Systems Lab N.Y.U.S. T. I. M. Externally growing self- organizing maps and its application to e-mail database visualization and exploration Presenter : Wu, Jia-Hao Authors : Andreas Nurnberger , Marcin Detyniecki ASC (2005) ˜ . .

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Embed Size (px)

Citation preview

Page 1: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Externally growing self-organizing maps and its application to e-mail database

visualization and exploration

Presenter : Wu, Jia-Hao

Authors : Andreas Nurnberger , Marcin Detyniecki

ASC (2005)

˜..

Page 2: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

2

Outline

Motivation

Objective

Methodology

Experiments

Conclusion

Personal Comments

Page 3: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Motivation

To handle this increased number of e-mails LISTSERV sends 30 million messages per day in approximately

190000 mailing lists.

The total number of mailing list messages at 36.5 billion per year.

The problem of classifying e-mails is particularly difficult. The mail contains irrelevant information in the from of signatures.

E-mails are very rich in made-up words, slang, abbreviations, as for instance e-mail smiles. ex:<g>, lol

Parts of preceding e-mails that may partly cover different topics.

Page 4: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Objective

Provide an intuitive visual profile of the considered mailing lists. The user can scan easily for e-mails similar in content.

Offer an intuitive navigation tool, were similar e-mails are located close to each other. The imports messages from a mailing list and arranges groups there e-

mails based on a similarity measure.

Page 5: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology1-SOM

A neural networks that cluster high-dimensional data vectors according to a similarity measure. Two-dimensional arrangements of squares or hexagons.

The clusters are arranged in a low-dimensional topology – grid structure. Objects are assigned to one cluster are similar to each other as in every cluster

analysis.

Objects of nearby clusters are expected to be more similar than objects in more distant clusters.

Page 6: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology1-SOM advantage & disadvantage

Advantage: Intuitive visualization.

Good exploration possibilities.

Disadvantage: The size and shape has to be defined in advance.

Solution The map usually has to be trained several times. To compute the classification error. To add empty cells which no document is assigned ,and the growing process can

be stopped. To use the growing SOM method.

Page 7: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology2 - Externally Growing SOM

The main alteration Use a hexagonal map structure.

Restrict the algorithm to add new units around the external units of the map.

To add a new unit close to the external unit, which achieved the highest error.

The accumulated error of one unit of the map exceeds a specified threshold value. The algorithm can be solve this state.

Page 8: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Methodology2 - A learning method for GSOM1:Predefine the initial grid size (usually 2 x 2 units are used).

2:initialize the assigned vectors with randomly selected values. Reset error values ei for every unit i.

3:train the map using all inputs patterns for a fixed number of iterations.

4:identify the unit with the largest accumulated error.

5:if the error does not exceed a threshold value stop training.

6:identify the external unit k with the largest accumulated error.

7:add a new unit to the unit k.

8:Continue with step 3.

9:Continue training of the map for a fixed number of iterations. Reduce the learning rate during training.

Page 9: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments

Example Used a data set consisting of 1000 feature vectors defining 6 randomly

generated clusters in the 3-dimensional data space (A).

An initial map was trained that be depicted (D).

Page 10: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

Add 150 randomly generated data points of class 4 (B) .

The map is depicted (E).

Page 11: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

Create a new class of 150 data points at the center (in-between of the 6 classes) (C).

The map is depicted (F).

Page 12: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

Experiments. Each document i is described by a numerical feature vector Di = {x1,…

xt}

The query vector can be compared to each document. A result list can be obtained by ordering the documents according to the computed

similarity.

It is to use binary term vectors.

1→the corresponding word is used in the document. 0→the word is not corresponding.

Improve the performance - Term weighting schemes.

Large weights → used frequently

Page 13: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

The importance of a word in a specific document of the considered collection:

The similarity S of two documents:

for each word k in the vocabulary the entropy as defined was computed:

Page 14: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

A document is described based on a ‘statistical fingerprint’

Page 15: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

The author created two simple artificial data sets. Define documents dealing with the topics fuzzy , neuro, genetic and five

arbitrary keywords.

Each consisting of 500 feature vectors.

Data set (A) :

just one of the keywords and five remaining keywords are arbitrarily chosen. Data set (B) :

contain exactly two of the keywords.

Page 16: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

A SOM was learned using the data set (A).

Second run we added to the training data set (A) (B)

Page 17: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

The capabilities of the tool Keyword search and visualization using the maps

the distribution of keyword search results can be visualized by coloring.

Navigating in the document map

We can identify the node’s neighbouring cells.

Content based searching

The document map reuse.

Global and user profile visualization

The user can decide the cells that who want to search.

Visualizing changes

The document map will be record.

Page 18: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

Content based searching

Page 19: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments (Cont.)

The application interface

Page 20: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Conclusion

Advantages: Applied to text documents, especially the combination of iterative

keyword search.

The GSOM are able to adapt their size and structure to the data.

Even if we add the new cells around the growing map. The result does not affect the ability of the map to learn any type of data.

Problem: Very short e-mails are often not correctly classified.

Future work: the use of non-text documents (e.g. images) and the integration of user

feedback.

Page 21: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Personal Comments

Advantage The faster exploration method.

Drawback …

Application Information retrieval.