27
A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti Malaysia Sarawak, Malaysia 2 Centre for Intelligent Systems Research, Deakin University, Australia. 1* [email protected] WSC 17 ( 2012)

A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Embed Size (px)

Citation preview

Page 1: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

A New Evolving Tree for Text Document Clustering and Visualization

1Wui Lee Chang, 1*Kai Meng Tay, 2Chee Peng Lim1Faculty of Engineering, Universiti Malaysia Sarawak, Malaysia

2Centre for Intelligent Systems Research, Deakin University, Australia.

1*[email protected]

WSC 17 ( 2012)

Page 2: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Presentation Outline

IntroductionProblem StatementsMotivations and ObjectivesPreliminary

Evolving TreeA General Application framework for Evolving

SystemsThe Proposed ProcedureExperimental resultsConcluding Remarks

Page 3: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Introduction: Clustering

• To group sets of data based on their similarity levels to groups/clusters

• Examples are Self Organizing Map(SOM), K-mean, Fuzzy C-mean.

Page 4: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Introduction: Textual Document Clustering

• To cluster/group sets of textual document based on their similarity levels. • To ease information retrieval.• Examples

– the naive Bayes-based document clustering model [21], – WEBSOM [22], and – support vector machines-based for imbalanced text document classification [23].

[21] Lewis, D.: Naïve Bayes at forty: The independence assumption in information retrieval. In: ECML (1998)

[22] Azcarraga, A.P., Yap, T.J., Tan, J., Chua, T.S.: Evaluating keyword selection methods for WEBSOM text archives. In: IEEE Transactions on Knowledge and Data Engineering, vol.16, no.3, pp. 380- 383 (2004)

[23] Liu, T., Loh, H.T., Sun, A.: Imbalanced text classification: A term weighting approach. In: Expert Systems with Applications, vol.36, pp.690-701, (2009).

Page 5: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Problem Statements : 1

• Traditional textual document clustering uses off-line learning.– Weakness:- needed to re-learn when new

document is fed.– Adaptive or evolving feature model can be the

alternative to traditional methods.– Evolving increase the learning flexibility. – WEBSOM focuses on off-line learning

Page 6: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Problem Statement: 2

• For SOM ( or WEBSOM) – the difficulty in determining the map size before

learning [19]. – The map size also affects the learning time [19].

[19] Pakkanen, J., Iivarinen, J., Oja, E.: The Evolving Tree – Analysis and Applications. In: IEEE Transactions on Neural Networks, vol. 17, no.3, pp.591-603 (2006)

Page 7: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Motivations and Objectives

• To construct an adaptive textual document clustering tool based on Evolving Tree (ETree).

• To apply a general application framework for Evolving Systems [24].

• To analyze the adaptive activity of the proposed method with UNIMAS ENCON 2008 articles.

[24] Lughofer, E.: Evolving Fuzzy Systems – Methodologies, Advanced Concepts and Applications. Ed.1, Springer (2011)

Page 8: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Preliminary: Evolving Tree (ETree)

• Formed a tree structure that contains root node, trunk nodes and leaf nodes.

• Root node is the first created node in the tree.• Trunk nodes is connecting the leaf nodes.• Leaf nodes are the clusters formed.• Able to expand hierarchically (form a tree

structure) to scale the data.• Hierarchical structure reduce the complexity

control.

Page 9: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Preliminary: Evolving Tree (ETree)

• Node is indexed by .• Each node is attributed with a best matching

unit (BMU) hit counter, .• Splitting threshold, , is predetermined.• Number of split children nodes,, is

predetermined.

Page 10: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Preliminary: Evolving Tree (ETree)- The learning Algorithm

• Finding of BMU.• Updating leaf nodes.• Expanding the tree.

Page 11: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Preliminary: Evolving Tree (ETree)--Finding BMU

𝑁5 ,2

𝑁 6 ,3𝑁7 ,3

𝑁1 , 0

𝑁 2 ,1𝑁 3 ,1

𝑁 4 , 2

𝑁 8 ,5 𝑁 9 ,5

BMU

Tree depth

Layer 1

Layer 2

Layer 3

Layer 4

Page 12: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Preliminary: Evolving Tree (ETree)--Updating Leaf Nodes

• Kohonen learning rules:

• neighbourhood function, .

Page 13: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

𝑑 (𝑛𝐵𝑀𝑈 ,𝑛𝑙 , 𝑗 )

𝑁5 ,2

𝑁 6 ,3𝑁7 ,3

𝑁1 , 0

𝑁 2 ,1𝑁 3 ,1

𝑁 4 , 2

𝑁 8 ,5 𝑁 9 ,5

BMU

1

2

3

Page 14: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

A General Application framework for Evolving Systems [24]

Initial Models (from batch off-line or former on-line training cycle)

Refine Expand Evolved Models

Pool of Evolved Models

Response (predictions,

classifications, …) of models for new data

Feedback on Quality

Response

Operator

Internal Algorithm

Incremental Feedback Loop (only new data is processed)

New Data

[24] Lughofer, E.: Evolving Fuzzy Systems – Methodologies, Advanced Concepts and Applications. Ed.1, Springer (2011)

Page 15: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

The Proposed Procedure

Updating terms of articles

ETree

Fetching on-line article

Refining trained model

Page 16: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

The Proposed Procedure :Preprocessing Text

• A new article (# label as )is fed.• Abstract of the article is extracted.• Stop words (119 words) are removed from the

abstract.• Numerical and symbol are also removed.• A corpus, is the article id.• is the term symbols.• is further associated with several attributes, {,, .

Page 17: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

The Proposed Procedure :Term Weighting

• Term weighting happens at new article only.• Inverse document frequency (idf) computes

the importance of a word/term based on its occurrence in .

Page 18: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

The Proposed Procedure : Similarity Match Histogram

• Training vectors, , are formed through binary descriptions of new article’s term with .

• Now, the dimension of is normalized with that of trained article .

Page 19: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

The Proposed Procedure : Similarity Match Histogram

• Compute the Euclidean distance between and .

• Compute the overall distance, , from .

• Finding of BMU:

Page 20: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

The Proposed Procedure : Expanding Tree

• If = , then is split into children nodes.

𝑁𝐵𝑀𝑈

𝑛 h𝑐 𝑖𝑙𝑑𝑛𝑜𝑑𝑒=2

Page 21: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Experimental results: Observation

• and are more similar to each other, as compared to .

𝑁 2,1

𝑁1,0

𝑁 3,1 𝑁 3,1

𝑁1,0

𝑁 2,1

𝑁 4,2 𝑁5,2

Page 22: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Experimental results: Observation

Root node

Trunk node

Leaf node

Page 23: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Experimental results: Complexity Control

 T

ime(s)

Label for articles 

𝑏𝑠𝑝𝑙𝑖𝑡𝑡𝑖𝑛𝑔=10

Page 24: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Experimental results: Tree structures with different

Number of Clusters

Tree size Tree depth

10 14 27 815 5 9 420 3 5 2

Page 25: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Concluding Remarks

• With the proposed approach, articles from ENCON 2008 could be clustered and visualized as a tree structure.

• In short, the proposed approach constitutes to a new decision support supporting tool for conference organizer.

• Besides, the proposed procedure could be useful with a larger number of articles with an expected increase in the computation complexity.

Page 26: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Future Works

• An ETree with dynamic setting will be developed.

• Other potential applications (e.g., image and signal processing) of ETree will be further investigated.

Page 27: A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti

Thank you for your attentions