18
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Text clustering using frequent itemsets Wen Zhang, Taketoshi Yoshida, Xijin Tang, Qing Wanga KBS, 2010 Presented by Wen-Chung Liao 2010/08/25

Text clustering using frequent itemsets

  • Upload
    tamarr

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Text clustering using frequent itemsets. Wen Zhang, Taketoshi Yoshida, Xijin Tang, Qing Wanga KBS, 2010 Presented by Wen-Chung Liao 2010/08/25. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Text clustering using frequent itemsets

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Text clustering using frequent itemsets

Wen Zhang, Taketoshi Yoshida, Xijin Tang, Qing WangaKBS, 2010

Presented by Wen-Chung Liao2010/08/25

Page 2: Text clustering using frequent itemsets

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outlines

Motivation Objectives Methodology Experiments Conclusions Comments

Page 3: Text clustering using frequent itemsets

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

The motivation of adopting frequent itemsets for document clustering: ─ the demand of dimensionality reduction for

representation.─ comprehensibility of clustering results.

Our motivation of proposing Maximum Capturing (MC) for document clustering is to produce natural and comprehensible document clusters.

Page 4: Text clustering using frequent itemsets

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objectives

Propose Maximum Capturing (MC) for text clustering using frequent itemsets.

MC can be divided into two components: ─ constructing document clusters and assigning docu

ment topics. Minimum spanning tree algorithm

construct document clusters with three types of similarity measures.

Topics assigning algorithm most frequent itemsets in a document cluster are assigned as the

topic of the cluster. Frequency sensitive competitive learning

normalize clusters into predefined number if necessary

Page 5: Text clustering using frequent itemsets

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Frequent Itemsets

Page 6: Text clustering using frequent itemsets

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology In MC, a document D is denoted by the frequent item

sets in it. Dj = {Fj1, . . . ,Fjn}, where n is the total number of freq

uent itemsets in Dj. sim[i][j], the similarity measure between Di and Dj in

three ways: (1) number of common frequent itemset they have in common; (2) total weights of frequent itemsets they have in common;

(3) the asymmetrical binary similarity between their frequent itemsets.

a similarity matrix A, with a[i][j]=sim[i][j], if i < j;

Set Minsup=3/9

D2={b, d, e}

={ {b}, {d}, {e}, {b, e}, {d, e} }

D4={b, e}

={ {b}, {e}, {b, e} }

(1)sim[2][4]=3

(2)Sim[2][4]=5+6+3=14

(3)?

Page 7: Text clustering using frequent itemsets

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The procedure of MC in constructing document clustersStep 1:

Construct similarity matrix A;

Step 2: Find the minimum value excluding zero in A;

Step 3: Find maximum value in A.

Find all document pairs of unclustered documents with the maximum value.

(i.e. at least one of the two document of the document pair has not been assigned to any cluster)

The basic idea of MC:if a document pair has the largest similarity among all the document pairs, then the two documents of the document pair should be assigned into same cluster.

Page 8: Text clustering using frequent itemsets

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The procedure of MC in constructing document clustersStep 4:

If the maximum value = the minimum value, all documents in corresponding document pairs which do not attach to any

cluster are used to construct a new cluster. If the maximum value <> the minimum value, then, for all found document pairs with the maximum value, the following

process is conducted. First, for each document pair if two documents do not belong to any cluster,

then these two documents are grouped together to construct a document cluster.

Next, for each document pair if one of its two documents belongs to an existing cluster, then the other unattached document is attached to this cluster.

Finally, similarities of the found document pairs were set to be zero in A. Go to step 3.

Step 5: If there are any documents which do not attach to any cluster, then each of these

documents is used to construct a new cluster.

Page 9: Text clustering using frequent itemsets

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Page 10: Text clustering using frequent itemsets

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The detailed procedures of MC in assigning document topicsStep 1: Select the most frequent itemsets with maximum lengt

h for each cluster as its candidate topic.

Step 2: Conduct reduction of the candidate topics for each cluster using following two processes.

Step 2.1: Assign the frequent itemset(s) with maximum length as the topic(s) for each cluster.

Step 2.2: For a frequent itemset in candidate topics of clusters in Step 1, if it is contained in the assigned topics of other clusters in Step 2.1, then the frequent itemset should be eliminated from the candidate topics of the cluster.

Page 11: Text clustering using frequent itemsets

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The working process of MC-1 for constructing topics of clusters in Fig. 3

Page 12: Text clustering using frequent itemsets

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The normalization process of MC for merging clustersStep 1:

The learning rates of all original clusters produced by Maximum Capturing are initialized as zero and we add all the original clusters into a cluster set;

Step 2: For all the clusters with minimum learning rate in the cluster set, the maximum similarity of cluster

pairs is calculated out. If the maximum similarity is equal to zero, then we terminate the normalization process. Otherwise, we randomly select a cluster (cluster 1) with minimum learning rate from the cluster set.

Step 3: Another cluster (cluster 2) is found from the cluster set which has maximum similarity with cluster

1. If the maximum similarity is equal to 0; then go to step 2.

Step 4: Cluster 1 and cluster 2 are merged into a new cluster (cluster 3). The earning rate of cluster 3 is set as the sum of learning rates of cluster 1 and cluster 2 plus 1.

Step 5: Cluster 1 and cluster 2 are removed from the cluster set and cluster 3 is appended into the cluster set. If the size of cluster set is equal to the predefined number, then the process should be terminated;

otherwise, go to step 2.

C1

α1=0

C2

α2=0

C3

α3=0

Page 13: Text clustering using frequent itemsets

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental evaluations

Benchmark data sets─ Two text representations

individual words vs. multiword expressions─ For English corpus, Reuters-21578 document collection, 135

categories six categories are selected, shown in Table 7 after preprocessing, 2485 unique individual words are obtained,

7927 multiword expressions─ For the Chinese corpus, TanCorpV1.0, 14,150 documents in 20

categories four categories are assigned, shown in Table 8 after preprocessing, 281,111 individual words are obtained. 9074 multiword expressions

Page 14: Text clustering using frequent itemsets

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Evaluation methods

Frequent pattern extraction• frequent itemsets: FP-tree

algorithm [15]• frequent sequences: BIDE

algorithm [18]• maximal frequent sequence:

algorithm in [19]

Page 15: Text clustering using frequent itemsets

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Page 16: Text clustering using frequent itemsets

16

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Page 17: Text clustering using frequent itemsets

17

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusions

MC is proposed to produce document clusters using frequent itemsets.. ─ It includes two procedures: constructing document clus

ters and assigning cluster topics. We also propose a normalization process for

Maximum Capturing to merge clusters into a predefined number.

The experimental results show that in document clustering, MC has better performance than other methods.

Page 18: Text clustering using frequent itemsets

18

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Comments

Advantages─ Documents denoted by frequent itemsets─ Similarity measures

Shortages─ Some mistakes

Applications─ Transactional data clustering