53
E.G.M. Petrakis Text Clustering 1 Clustering “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99 ] Instances within a cluster are very similar Instances in different clusters are very different

E.G.M. PetrakisText Clustering1 Clustering “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

Embed Size (px)

Citation preview

Page 1: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 1

Clustering

“Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99]

Instances within a cluster are very similar

Instances in different clusters are very different

Page 2: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 2

Example

.

...

. ..

.

.

.

.

.......

.

.

..

term1

term2

Page 3: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 3

Applications

Faster retrieval Faster and better browsingStructuring of search results Revealing classes and other data

regularities Directory constructionBetter data organization in general

Page 4: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 4

Cluster Searching

Similar instances tend to be relevant to the same requests

The query is mapped to the closest cluster by comparison with the cluster-centroids

Page 5: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 5

Notation

N: number of elementsClass: real world grouping – ground

truthCluster: grouping by algorithmThe ideal clustering algorithm will

produce clusters equivalent to real world classes with exactly the same members

Page 6: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 6

Problems

How many clusters ?Complexity? N is usually largeQuality of clusteringWhen a method is better than

another?Overlapping clusters Sensitivity to outliers

Page 7: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 7

Example

... .

. ....

.

.... .

....

.

..........

. .

...

. ... ... .

Page 8: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 8

Clustering ApproachesDivisive: build clusters “top down” starting

from the entire data set K-means, Bisecting K-meansHierarchical or flat clustering

Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher levelHierarchical clustering

Combinations of the aboveBuckshot algorithm

Page 9: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 9

Hierarchical – Flat Clustering

Flat: all clusters at the same level K-means, Buckshot

Hierarchical: nested sequence of clustersSingle cluster with all data at the top &

singleton clusters at the bottomIntermediate levels are more useful Every intermediate level combines two clusters

from the next lower levelAgglomerative, Bisecting K-means

Page 10: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 10

Flat Clustering

.

...

. ..

.

.

.

.

.......

.

.

..

Page 11: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 11

Hierarchical Clustering

..

.. .. .. .. ..... .. ... ..

.. . ....

5

46

7

23

1

4

1

2 3

5 6 7

Page 12: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 12

Text Clustering

Finds overall similarities among documents or groups of documentsFaster searching, browsing etc.

Needs to know how to compute the similarity (or equivalently the distance) between documents

Page 13: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 13

Query – Document Similarity

M

i id

M

i id

M

i idid

ww

ww

dd

ddddSim

1

2

1

2

1

21

2121

21

21

||||),(

Similarity is defined as the cosine of the angle between document and query vectors

θ

d1

d2

Page 14: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 14

Document Distance

Consider documents d1, d2 with vectors u1, u2

Their distance is defined as the length AB

)),(im-1(2

))cos(-1(2

)2/sin(2

),(tan

21

21

ddS

ddcedis

Page 15: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 15

M

k kj

ijij

w

ww

1

2'

Normalization by Document Length

The longer the document is, the more likely it is for a given term to appear in it

Normalize the term weights by document length (so terms in long documents are not given more weight)

Page 16: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 16

Evaluation of Cluster Quality

Clusters can be evaluated using internal or external knowledge

Internal Measures: intra cluster cohesion and cluster separability intra cluster similarity inter cluster similarity

External measures: quality of clusters compared to real classes Entropy (E), Harmonic Mean (F)

Page 17: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 17

Intra Cluster SimilarityA measure of cluster cohesionDefined as the average pair-wise similarity

of documents in a cluster

Where : cluster centroid

Documents (not centroids) have unit length

Sd

dS

c 1

SdSdSdd

cccdS

dS

ddsimS '

2

',2 '

11)',(

1

Page 18: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 18

Inter Cluster Similarity

a) Single Link: similarity of two most similar members

b) Complete Link: similarity of two least similar members

c) Group Average: average similarity between members

',,),(max '' ScScccsim jiji

',,),(min '' ScScccsim jiji

)',('

''

11)',(

'

1

'''

ccsimcc

dS

dS

ddsimSS SdSd

CdCd

Page 19: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 19

Example

.. .

.c c’

single link

complete link

groupaverage

S S’

Page 20: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 20

Entropy

Measures the quality of flat clusters using external knowledge Pre-existing classificationAssessment by experts

Pij: probability that a member of cluster j belong to class i

The entropy of cluster j is defined as Ej=-ΣiPijlogPij

Page 21: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 21

Entropy (con’t)Total entropy for all clusters

Where nj is the size of cluster jm is the number of clustersN is the number of instancesThe smaller the value of E is the better the

quality of the algorithm isThe best entropy is obtained when each

cluster contains exactly one instance

j

m

j

j EN

nE

1

Page 22: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 22

Harmonic Mean (F)

Treats each cluster as a query result F combines precision (P) and recall (R) Fij for cluster j and class i is defined as

nij: number of instances of class i in cluster j,

ni: number of instances of class i,

nj: number of instances of cluster j

, where1

P1

2F

ij

iji

ijij

j

ijij

ij

n

nR

n

nP

R

Page 23: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 23

Harmonic Mean (con’t)

The F value of any class i is the maximum value it achieves over all j

Fi = maxj Fij

The F value of a clustering solution is computed as the weighted average over all classes

Where N is the number of data instances

i

m

i

i Fn

nF

1

Page 24: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 24

Quality of ClusteringA good clustering method

Maximizes intra-cluster similarityMinimizes inter cluster similarityMinimizes Entropy Maximizes Harmonic Mean

Difficult to achieve all together simultaneously

Maximize some objective function of the above

An algorithm is better than an other if it has better values on most of these measures

Page 25: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 25

K-means Algorithm

Select K centroids Repeat I times or until the centroids

do not change Assign each instance to the cluster

represented by its nearest centroidCompute new centroids Reassign instances Compute new centroids…….

Page 26: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

21/04/23 Nikos Hourdakis, MSc Thesis 26

K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

Page 27: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

21/04/23 Nikos Hourdakis, MSc Thesis 27

K-Means demo (2/7)

Page 28: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

21/04/23 Nikos Hourdakis, MSc Thesis 28

K-Means demo (3/7)

Page 29: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

21/04/23 Nikos Hourdakis, MSc Thesis 29

K-Means demo (4/7)

Page 30: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

21/04/23 Nikos Hourdakis, MSc Thesis 30

K-Means demo (5/7)

Page 31: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

21/04/23 Nikos Hourdakis, MSc Thesis 31

K-Means demo (6/7)

Page 32: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

21/04/23 Nikos Hourdakis, MSc Thesis 32

K-Means demo (7/7)

Page 33: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 33

Comments on K-Means (1)

Generates a flat partition of K clustersK is the desired number of clusters and

must be known in advanceStarts with K random cluster centroids A centroid is the mean or the median

of a group of instancesThe mean rarely corresponds to a real

instance

Page 34: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 34

Comments on K-Means (2)

Up to I=10 iterationsKeep the clustering resulted in best

inter/intra similarity or the final clusters after I iterations

Complexity O(IKN)A repeated application of K-Means

for K=2, 4,… can produce a hierarchical clustering

Page 35: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 35

Choosing Centroids for K-means

Quality of clustering depends on the selection of initial centroids

Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings.

Select good initial centroids using a heuristic or the results of another methodBuckshot algorithm

Page 36: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 36

Incremental K-Means

Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration

Reassign instances to clusters at the end of each iteration

Converges faster than simple K-meansUsually 2-5 iterations

Page 37: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 37

Bisecting K-Means

Starts with a single cluster with all instances

Select a cluster to split: larger cluster or cluster with less intra similarity

The selected cluster is split into 2 partitions using K-means (K=2)

Repeat up to the desired depth hHierarchical clusteringComplexity O(2hN)

Page 38: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 38

Agglomerative Clustering

Compute the similarity matrix between all pairs of instances

Starting from singleton clustersRepeat until a single cluster remains

Merge the two most similar clustersReplace them with a single cluster Replace the merged cluster in the

matrix and update the similarity matrixComplexity O(N2)

Page 39: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 39

Similarity Matrix

C1=d1 C2=d2 … CN=dN

C1=d1 1 0.8 … 0.3

C2=d2 0.8 1 … 0.6

…. … … 1 …

CN=dN 0.3 0.6 … 1

Page 40: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 40

Update Similarity Matrix

C1=d1 C2=d2 … CN=dN

C1=d1 1 0.8 … 0.3

C2=d2 0.8 1 … 0.6

…. … … 1 …

CN=dN 0.3 0.6 … 1

merged

merged

Page 41: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 41

New Similarity Matrix

C12=

d1 d2

… CN=dN

C12 =

d1 d2

1 … 0.4

… … 1 …

CN=dN 0.4 … 1

Page 42: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 42

Single Link

Selecting the most similar clusters for merging using single link

Can result in long and thin clusters due to “chaining effect”Appropriate in some domains, such as

clustering islands

', ),(max '' ScScccsim jiji

Page 43: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 43

Complete Link

Selecting the most similar clusters for merging using complete link

Results in compact, spherical clusters that are preferable

', ),(min '' ScScccsim jiji

Page 44: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 44

Group Average

Selecting the most similar clusters for merging using group average

Fast compromise between single and complete link

)',('

''

11)',(

'

1

'''

ccsimcc

dS

dS

ddsimSS SdSd

SdSd

Page 45: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 45

Example

.. .

.c1

c2

single link

complete link

groupaverage

A B

Page 46: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 46

Inter Cluster Similarity

A new cluster is represented by its centroid

The document to cluster similarity is computed as

The cluster-to-cluster similarity can be computed as single, complete or group average similarity

Sd

dS

c 1

cdcdsim

),(

Page 47: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 47

Buckshot K-Means

Combines Agglomerative and K-MeansAgglomerative results in a good

clustering solution but has O(N2) complexity

Randomly select a sample N instancesApplying Agglomerative on the sample

which takes (N) time Take the centroids of the cluster as

input to K-Means Overall complexity is O(N)

Page 48: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 48

Example

4

1

2 3

65 7

8 9 10 11 12 13 14 15

initialcetroids

for K-Means

Page 49: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 49

More on Clustering

Sound methods based on the document-to-document similarity matrixgraph theoretic methodsO(N2) time

Iterative methods operating directly on the document vectorsO(NlogN),O(N2/logN), O(mN) time

Page 50: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 50

Soft Clustering

Hard clustering: each instance belongs to exactly one clusterDoes not allow for uncertaintyAn instance may belong to two or more clusters

Soft clustering is based on probabilities that an instance belongs to each of a set of clustersprobabilities of all categories must sum to 1Expectation Minimization (EM) is the most

popular approach

Page 51: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 51

More Methods

Two documents with similarity > T (threshold) are connected with an edge [Duda&Hart73]

clusters: the connected components (maximal cliques) of the resulting graph

problem: selection of appropriate threshold T

Zahn’s method [Zahn71]

Page 52: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 52

Zahn’s method [Zahn71]

1. Find the minimum spanning tree 2. for each doc delete edges with length l > lavg

lavg: average distance if its incident edges

3. clusters: the connected components of the graph

the dashed edge is inconsistent and is deleted

Page 53: E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into

E.G.M. Petrakis Text Clustering 53

References "Searching Multimedia Databases by Content",

Christos Faloutsos, Kluwer Academic Publishers, 1996

“A Comparison of Document Clustering Techniques”, M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop on Text Mining,2000

“Data Clustering: A Review”, A.K. Jain, M.N. Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept. 99.

“Algorithms for Clustering Data” A.K. Jain, R.C. Dubes; Prentice-Hall , 1988, ISBN 0-13-022278-X

“Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, G. Salton, Addison-Wesley, 1989