73
Chapter DM:II (continued) II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-198 Cluster Analysis © STEIN 2006-2019

Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Chapter DM:II (continued)

II. Cluster Analysisq Cluster Analysis Basicsq Hierarchical Cluster Analysisq Iterative Cluster Analysisq Density-Based Cluster Analysisq Cluster Evaluationq Constrained Cluster Analysis

DM:II-198 Cluster Analysis © STEIN 2006-2019

Page 2: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationOverview

“The validation of clustering structures is the most difficult andfrustrating part of cluster analysis. Without a strong effort in thisdirection, cluster analysis will remain a black art accessible only to thosetrue believers who have experience and great courage.”

[Jain/Dubes 1990]

DM:II-199 Cluster Analysis © STEIN 2006-2019

Page 3: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation [Tan/Steinbach/Kumar 2005]

Random points

DM:II-200 Cluster Analysis © STEIN 2006-2019

Page 4: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation [Tan/Steinbach/Kumar 2005]

Random points DBSCAN

DM:II-201 Cluster Analysis © STEIN 2006-2019

Page 5: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation [Tan/Steinbach/Kumar 2005]

Random points DBSCAN

k-means

DM:II-202 Cluster Analysis © STEIN 2006-2019

Page 6: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation [Tan/Steinbach/Kumar 2005]

Random points DBSCAN

k-means Complete link

DM:II-203 Cluster Analysis © STEIN 2006-2019

Page 7: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationOverview

Cluster evaluation can address different issues:

q Provide evidence whether data contains non-random structures.

q Relate found structures in the data to externally provided class information.

q Rank alternative clusterings with regard to their quality.

q Determine the ideal number of clusters.

q Provide information to choose a suited clustering approach.

DM:II-204 Cluster Analysis © STEIN 2006-2019

Page 8: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationOverview

Cluster evaluation can address different issues:

q Provide evidence whether data contains non-random structures.

q Relate found structures in the data to externally provided class information.

q Rank alternative clusterings with regard to their quality.

q Determine the ideal number of clusters.

q Provide information to choose a suited clustering approach.

(1) External validity measures:Analyze how close is a clustering to an (external) reference.

(2) Internal validity measures:Analyze intrinsic characteristics of a clustering.

(3) Relative validity measures:Analyze the sensitivity (of internal measures) during clustering generation.

DM:II-205 Cluster Analysis © STEIN 2006-2019

Page 9: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationOverview

covering analysis

information-theoretic

external

internal

relative

absolute

clustervalidity static: structure analysis

dynamic: re-cluster stability

elbow criterion,GAP statistics

F-Measure, Purity,RAND statistics

dilution analysis

Kullback-Leibler,Entropy

Davies-Bouldin,Dunn, ρ, λ

(1) External validity measures:Analyze how close is a clustering to an (external) reference.

(2) Internal validity measures:Analyze intrinsic characteristics of a clustering.

(3) Relative validity measures:Analyze the sensitivity (of internal measures) during clustering generation.

DM:II-206 Cluster Analysis © STEIN 2006-2019

Page 10: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

DM:II-207 Cluster Analysis © STEIN 2006-2019

Page 11: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Class i

DM:II-208 Cluster Analysis © STEIN 2006-2019

Page 12: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

DM:II-209 Cluster Analysis © STEIN 2006-2019

Page 13: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

(node-based analysis)

TruthP N

Hypothesis P TP (a) FP (b)N FN (c) TN (d)

DM:II-210 Cluster Analysis © STEIN 2006-2019

Page 14: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

(node-based analysis)

TruthP N

Hypothesis P TP (a) FP (b)N FN (c) TN (d)

DM:II-211 Cluster Analysis © STEIN 2006-2019

Page 15: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

(node-based analysis)

TruthP N

Hypothesis P TP (a) FP (b)N FN (c) TN (d)

DM:II-212 Cluster Analysis © STEIN 2006-2019

Page 16: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

(node-based analysis)

TruthP N

Hypothesis P TP (a) FP (b)N FN (c) TN (d)

DM:II-213 Cluster Analysis © STEIN 2006-2019

Page 17: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

(node-based analysis)

TruthP N

Hypothesis P TP (a) FP (b)N FN (c) TN (d)

DM:II-214 Cluster Analysis © STEIN 2006-2019

Page 18: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

(node-based analysis)

TruthP N

Hypothesis P TP (a) FP (b)N FN (c) TN (d)

Precision:

a

a + b

Recall:

a

a + c

DM:II-215 Cluster Analysis © STEIN 2006-2019

Page 19: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Cluster j

(node-based analysis)

TruthP N

Hypothesis P TP (a) FP (b)N FN (c) TN (d)

Precision:

a

a + b

Recall:

a

a + c

F -measure:

Fα =1 + α

1precision + α

recall

α = 1 harmonic meanα ∈ (0; 1) favor precision over recallα > 1 favor recall over precision

DM:II-216 Cluster Analysis © STEIN 2006-2019

Page 20: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Classes:

DM:II-217 Cluster Analysis © STEIN 2006-2019

Page 21: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Recall / = 0.26 Precision /( ∪ ) = 0.94 F-Measure = 0.40

In cluster:Target:Classes:

High precision, low recall ⇒ low F -measure.

DM:II-218 Cluster Analysis © STEIN 2006-2019

Page 22: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Recall / = 0.59 Precision /( ∪ ) = 0.53 F-Measure = 0.56

In cluster:Target:Classes:

Low precision, low recall ⇒ low F -measure.

DM:II-219 Cluster Analysis © STEIN 2006-2019

Page 23: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)

Recall / = 0.92 Precision /( ∪ ) = 0.99 F-Measure = 0.95

In cluster:Target:Classes:

High precision, high recall ⇒ high F -measure.

DM:II-220 Cluster Analysis © STEIN 2006-2019

Page 24: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Clustering)

Cluster j

(node-based analysis)

q Clustering C = {C1, . . . , Ck} and classification C∗ = {C∗1 , . . . , C∗l } of D.

q Fi,j is the F -measure of a cluster j computed with respect to a class i.

Precision of cluster j with respect to class i is |Cj ∩ C∗i |/|Cj| (here: Preci,j = 0.71)

Recall of cluster j with respect to class i is |Cj ∩ C∗i |/|C∗i | (here: Reci,j = 1.0)

DM:II-221 Cluster Analysis © STEIN 2006-2019

Page 25: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Clustering)

Cluster j

(node-based analysis)

q Clustering C = {C1, . . . , Ck} and classification C∗ = {C∗1 , . . . , C∗l } of D.

q Fi,j is the F -measure of a cluster j computed with respect to a class i.

Precision of cluster j with respect to class i is |Cj ∩ C∗i |/|Cj| (here: Preci,j = 0.71)

Recall of cluster j with respect to class i is |Cj ∩ C∗i |/|C∗i | (here: Reci,j = 1.0)

Ü Micro-averaged F -measure for 〈D, C, C∗〉 :

F =

l∑i=1

|C∗i ||D| · max

j=1,...,k{Fi,j}

DM:II-222 Cluster Analysis © STEIN 2006-2019

Page 26: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: F -Measure (for a Clustering)

Cluster j

(node-based analysis)

q Clustering C = {C1, . . . , Ck} and classification C∗ = {C∗1 , . . . , C∗l } of D.

q Fi,j is the F -measure of a cluster j computed with respect to a class i.

Precision of cluster j with respect to class i is |Cj ∩ C∗i |/|Cj| (here: Preci,j = 0.71)

Recall of cluster j with respect to class i is |Cj ∩ C∗i |/|C∗i | (here: Reci,j = 1.0)

Ü Macro-averaged F -measure for 〈D, C, C∗〉 :

F =1

l

l∑i=1

maxj=1,...,k

{Fi,j}

DM:II-223 Cluster Analysis © STEIN 2006-2019

Page 27: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Remarks:

q Micro averaging treats objects (documents) equally, whereas macro averaging treats classesequally.

DM:II-224 Cluster Analysis © STEIN 2006-2019

Page 28: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: Entropy

Cluster j

(node-based analysis)

DM:II-225 Cluster Analysis © STEIN 2006-2019

Page 29: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: Entropy

Cluster j

(node-based analysis)

q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).

DM:II-226 Cluster Analysis © STEIN 2006-2019

Page 30: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: Entropy

Cluster j

(node-based analysis)

q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).

L1 = , L2 = , P̂ ( ) = 10/14, P̂ ( ) = 4/14

DM:II-227 Cluster Analysis © STEIN 2006-2019

Page 31: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: Entropy

Cluster j

(node-based analysis)

q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).

L1 = , L2 = , P̂ ( ) = 10/14, P̂ ( ) = 4/14

q Entropy of L : H(L) = −∑li=1 P (Li) · log 2(P (Li))

Entropy of Cj wrt. C∗ : H(Cj) = −∑

Cj∩C∗i 6=∅|Cj ∩ C∗i |/|Cj| · log 2(|Cj ∩ C∗i |/|Cj|)

DM:II-228 Cluster Analysis © STEIN 2006-2019

Page 32: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: Entropy

Cluster j

(node-based analysis)

q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).

L1 = , L2 = , P̂ ( ) = 10/14, P̂ ( ) = 4/14

q Entropy of L : H(L) = −∑li=1 P (Li) · log 2(P (Li))

Entropy of Cj wrt. C∗ : H(Cj) = −∑

Cj∩C∗i 6=∅|Cj ∩ C∗i |/|Cj| · log 2(|Cj ∩ C∗i |/|Cj|)

Ü Entropy of C wrt. C∗ : H(C) =∑Cj∈C|Cj|/|D| ·H(Cj)

DM:II-229 Cluster Analysis © STEIN 2006-2019

Page 33: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: Rand, Jaccard

Cluster j

true positive

false positive

true negative

false negative

(edge-based analysis)

DM:II-230 Cluster Analysis © STEIN 2006-2019

Page 34: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(1) External Validity Measures: Rand, Jaccard

Cluster j

true positive

false positive

true negative

false negative

(edge-based analysis)

q R(C) =|TP | + |TN |

|TP | + |TN | + |FP | + |FN | =|TP | + |TN |n(n− 1)/2

, with n = |D|

q J(C) =|TP |

|TP | + |FP | + |FN |

DM:II-231 Cluster Analysis © STEIN 2006-2019

Page 35: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2

...− − − − − 1.0 0.6− − − − − − 1.0

1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1

...− − − − − 1 1− − − − − − 1

q Construct occurrence matrix based on cluster analysis.

q Compare similarity matrix to occurrence matrix: correlation τDM:II-232 Cluster Analysis © STEIN 2006-2019

Page 36: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2

...− − − − − 1.0 0.6− − − − − − 1.0

1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1

...− − − − − 1 1− − − − − − 1

q Construct occurrence matrix based on cluster analysis.

q Compare similarity matrix to occurrence matrix: correlation τDM:II-233 Cluster Analysis © STEIN 2006-2019

Page 37: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2

...− − − − − 1.0 0.6− − − − − − 1.0

1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1

...− − − − − 1 1− − − − − − 1

q Construct occurrence matrix based on cluster analysis.

q Compare similarity matrix to occurrence matrix: correlation τDM:II-234 Cluster Analysis © STEIN 2006-2019

Page 38: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2

...− − − − − 1.0 0.6− − − − − − 1.0

1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1

...− − − − − 1 1− − − − − − 1

q Construct occurrence matrix based on cluster analysis.

q Compare similarity matrix to occurrence matrix: correlation τDM:II-235 Cluster Analysis © STEIN 2006-2019

Page 39: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

k-meansτ = 0.58

k-meansτ = 0.92

1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2

...− − − − − 1.0 0.6− − − − − − 1.0

1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1

...− − − − − 1 1− − − − − − 1

q Construct occurrence matrix based on cluster analysis.

q Compare similarity matrix to occurrence matrix: correlation τDM:II-236 Cluster Analysis © STEIN 2006-2019

Page 40: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

k-means at structured data. Similarity matrix sorted by cluster label.

DM:II-237 Cluster Analysis © STEIN 2006-2019

Page 41: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

DBSCAN at random data. Similarity matrix sorted by cluster label.

DM:II-238 Cluster Analysis © STEIN 2006-2019

Page 42: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

k-means at random data. Similarity matrix sorted by cluster label.

DM:II-239 Cluster Analysis © STEIN 2006-2019

Page 43: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

Complete link at random data. Similarity matrix sorted by cluster label.

DM:II-240 Cluster Analysis © STEIN 2006-2019

Page 44: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]

DBSCAN at structured data. Similarity matrix sorted by cluster label.

DM:II-241 Cluster Analysis © STEIN 2006-2019

Page 45: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Structural Analysis

q Distance for two clusters, dC(C1, C2).

q Diameter of a cluster, ∆(C).

q Scatter within a cluster, σ2(C), SSE.DM:II-242 Cluster Analysis © STEIN 2006-2019

Page 46: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Structural Analysis

dC(C1, C2)

q Distance for two clusters, dC(C1, C2).

q Diameter of a cluster, ∆(C).

q Scatter within a cluster, σ2(C), SSE.DM:II-243 Cluster Analysis © STEIN 2006-2019

Page 47: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Structural Analysis

dC(C1, C2)

∆(C)

q Distance for two clusters, dC(C1, C2).

q Diameter of a cluster, ∆(C).

q Scatter within a cluster, σ2(C), SSE.DM:II-244 Cluster Analysis © STEIN 2006-2019

Page 48: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Structural Analysis

dC(C1, C2)

∆(C)

σ2(C)

q Distance for two clusters, dC(C1, C2).

q Diameter of a cluster, ∆(C).

q Scatter within a cluster, σ2(C), SSE.DM:II-245 Cluster Analysis © STEIN 2006-2019

Page 49: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Dunn Index

I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}

,

I(C)→ max

DM:II-246 Cluster Analysis © STEIN 2006-2019

Page 50: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Dunn Index

I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}

,

I(C)→ max

Cluster distance

DM:II-247 Cluster Analysis © STEIN 2006-2019

Page 51: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Dunn Index

I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}

,

I(C)→ max

Cluster diameter

DM:II-248 Cluster Analysis © STEIN 2006-2019

Page 52: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Dunn Index

I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}

,

I(C)→ max

q Dunn is susceptible to noise.

q Dunn is biased towards the worst substructure in a clustering (cf. the min).

q Dunn value too low since distances and diameters are not put into relation.DM:II-249 Cluster Analysis © STEIN 2006-2019

Page 53: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Dunn Index

∆(C1)

C1C3

C2

dC(C1, C2)

I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}

,

I(C)→ max

q Dunn is susceptible to noise.

q Dunn is biased towards the worst substructure in a clustering (cf. the min).

q Dunn value too low since distances and diameters are not put into relation.DM:II-250 Cluster Analysis © STEIN 2006-2019

Page 54: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

DM:II-251 Cluster Analysis © STEIN 2006-2019

Page 55: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Different models (feature sets) yield different similarity graphs.

DM:II-252 Cluster Analysis © STEIN 2006-2019

Page 56: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Different models (feature sets) yield different similarity graphs.

DM:II-253 Cluster Analysis © STEIN 2006-2019

Page 57: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Compare (for alternative clusterings) the similarity density within the clusters to theaverage similarity of the entire graph.

DM:II-254 Cluster Analysis © STEIN 2006-2019

Page 58: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Compare (for alternative clusterings) the similarity density within the clusters to theaverage similarity of the entire graph.

DM:II-255 Cluster Analysis © STEIN 2006-2019

Page 59: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Compare (for alternative clusterings) the similarity density within the clusters to theaverage similarity of the entire graph.

DM:II-256 Cluster Analysis © STEIN 2006-2019

Page 60: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Graph G = 〈V,E〉 :

q G is called sparse if |E| = O(|V |), G is called dense if |E| = O(|V |2)

Ü the density θ computes from the equation |E| = |V |θ

DM:II-257 Cluster Analysis © STEIN 2006-2019

Page 61: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Graph G = 〈V,E〉 :

q G is called sparse if |E| = O(|V |), G is called dense if |E| = O(|V |2)

Ü the density θ computes from the equation |E| = |V |θ

Similarity graph G = 〈V,E,w〉 :

q |E| ∼ w(G) =∑e∈E

w(e)

Ü the density θ computes from the equation w(G) = |V |θ

DM:II-258 Cluster Analysis © STEIN 2006-2019

Page 62: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]

Graph G = 〈V,E〉 :

q G is called sparse if |E| = O(|V |), G is called dense if |E| = O(|V |2)

Ü the density θ computes from the equation |E| = |V |θ

Similarity graph G = 〈V,E,w〉 :

q |E| ∼ w(G) =∑e∈E

w(e)

Ü the density θ computes from the equation w(G) = |V |θ

Cluster Ci induces subgraph Gi :

Ü the expected density ρ relates the density of Gi to the density average in G

ρ(Gi) =w(Gi)

|Vi|θDM:II-259 Cluster Analysis © STEIN 2006-2019

Page 63: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion

1. Hyperparameter alternatives of a clustering algorithm: π1, . . . , πmq number of centroids for k-meansq stopping level for hierarchical algorithmsq neighborhood size for DBSCAN

2. Set of clusterings C = {Cπ1, . . . , Cπm} associated with π1, . . . , πm.

3. Points of an error curve {(πi, e(Cπi)

)| i = 1, . . . ,m}.

DM:II-260 Cluster Analysis © STEIN 2006-2019

Page 64: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion

1. Hyperparameter alternatives of a clustering algorithm: π1, . . . , πmq number of centroids for k-meansq stopping level for hierarchical algorithmsq neighborhood size for DBSCAN

2. Set of clusterings C = {Cπ1, . . . , Cπm} associated with π1, . . . , πm.

3. Points of an error curve {(πi, e(Cπi)

)| i = 1, . . . ,m}.

DM:II-261 Cluster Analysis © STEIN 2006-2019

Page 65: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion

1. Hyperparameter alternatives of a clustering algorithm: π1, . . . , πmq number of centroids for k-meansq stopping level for hierarchical algorithmsq neighborhood size for DBSCAN

2. Set of clusterings C = {Cπ1, . . . , Cπm} associated with π1, . . . , πm.

3. Points of an error curve {(πi, e(Cπi)

)| i = 1, . . . ,m}.

π = cluster number

e = SSE

|V|1 k

4. Find the point that maximizes error reduction with regard to its successor.

DM:II-262 Cluster Analysis © STEIN 2006-2019

Page 66: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion

dC: Hamming distanceMerging: complete link

http://cs.jhu.edu/~razvanm/fs-expedition/2.6.x.html

Relations between 1377 file systems for Linux Kernel 2.6.0. [Razvan Musaloiu 2009]

DM:II-263 Cluster Analysis © STEIN 2006-2019

Page 67: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion

dC: Hamming distanceMerging: group average link

http://cs.jhu.edu/~razvanm/fs-expedition/2.6.x.html

Relations between 1377 file systems for Linux Kernel 2.6.0. [Razvan Musaloiu 2009]

DM:II-264 Cluster Analysis © STEIN 2006-2019

Page 68: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationCorrelation between External and Internal Measures

In the wild, we are not given a reference classification.

Ü An external evaluation is not possible.(though many papers report on such experiments)

Ü Resort to an internal evaluation.(connectivity, squared error sums, distance-diameter heuristics, etc.)

DM:II-265 Cluster Analysis © STEIN 2006-2019

Page 69: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationCorrelation between External and Internal Measures

In the wild, we are not given a reference classification.

Ü An external evaluation is not possible.(though many papers report on such experiments)

Ü Resort to an internal evaluation.(connectivity, squared error sums, distance-diameter heuristics, etc.)

“To which extent can an internal evaluation φ be used to predict for aclustering its distance from the best reference classification—say, topredict the F -measure?”

argmaxφ

{τ〈X, Y 〉 | x = F (C), y = φ(C), C ∈ C}[Stein/Meyer zu Eissen 2007]

DM:II-266 Cluster Analysis © STEIN 2006-2019

Page 70: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationCorrelation between External and Internal Measures

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F-M

easu

re

Cluster Validity

Perfect correlation (desired).DM:II-267 Cluster Analysis © STEIN 2006-2019

Page 71: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationCorrelation between External and Internal Measures

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-35 -30 -25 -20 -15 -10 -5 0

F-M

easu

re

Davies-Bouldin

5 classes, 800 documents

Davies-Bouldin:1

k∑i=1

maxj

s(Ci) + s(Cj)

dC(Ci, Cj)

Prefers spherical clusters.DM:II-268 Cluster Analysis © STEIN 2006-2019

Page 72: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationCorrelation between External and Internal Measures

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

F-M

easu

re

Dunn Index

5 classes, 800 documents

Dunn Index:mini 6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}

Maximizes dilatation = inter/intra-cluster-diameter.DM:II-269 Cluster Analysis © STEIN 2006-2019

Page 73: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019

Cluster EvaluationCorrelation between External and Internal Measures

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

600 650 700 750 800 850 900 950 1000

F-M

easu

re

Expected Density

5 classes, 800 documents

Expected Density: ρ̄ =

k∑i=1

|Vi||V | ·

w(Gi)

|Vi|θ

Independent of cluster forms and sizes.DM:II-270 Cluster Analysis © STEIN 2006-2019