97
CSE4334/5334 Data Mining Clustering Chengkai Li University of Texas at Arlington Fall 2016

CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Embed Size (px)

Citation preview

Page 1: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

CSE43345334 Data Mining Clustering

Chengkai Li University of Texas at Arlington Fall 2016

What is Cluster Analysis Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Applications of Cluster Analysis Understanding

Summarization

Discovered Clusters Industry Group

1 Applied-Matl-DOWNBay-Network-Down3-COM-DOWN Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN

DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down

Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN Sun-DOWN

Technology1-DOWN

2 Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN ADV-Micro-Device-DOWNAndrew-Corp-DOWN

Computer-Assoc-DOWNCircuit-City-DOWN Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN

Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWNFed-Home-Loan-DOWN MBNA-Corp-DOWNMorgan-Stanley-DOWN

Financial-DOWN

4 Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP Louisiana-Land-UPPhillips-Petro-UPUnocal-UP

Schlumberger-UP

Oil-UP

Clustering precipitation in Australia

What is not Cluster Analysis Supervised classification

Simple segmentation

Results of a query

Graph partitioning

Notion of a Cluster can be Ambiguous

How many clusters

Four Clusters Two Clusters

Six Clusters

Types of Clusterings A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

Hierarchical clustering

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100

Discovered Clusters

Industry Group

1

Applied-Matl-DOWNBay-Network-Down3-COM-DOWN

Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN

DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN

Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down

Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN

Sun-DOWN

Technology1-DOWN

2

Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN

ADV-Micro-Device-DOWNAndrew-Corp-DOWN

Computer-Assoc-DOWNCircuit-City-DOWN

Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN

Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN

Technology2-DOWN

3

Fannie-Mae-DOWNFed-Home-Loan-DOWN

MBNA-Corp-DOWNMorgan-Stanley-DOWN

Financial-DOWN

4

Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP

Louisiana-Land-UPPhillips-Petro-UPUnocal-UP

Schlumberger-UP

Oil-UP

Page 2: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

What is Cluster Analysis Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Applications of Cluster Analysis Understanding

Summarization

Discovered Clusters Industry Group

1 Applied-Matl-DOWNBay-Network-Down3-COM-DOWN Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN

DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down

Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN Sun-DOWN

Technology1-DOWN

2 Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN ADV-Micro-Device-DOWNAndrew-Corp-DOWN

Computer-Assoc-DOWNCircuit-City-DOWN Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN

Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWNFed-Home-Loan-DOWN MBNA-Corp-DOWNMorgan-Stanley-DOWN

Financial-DOWN

4 Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP Louisiana-Land-UPPhillips-Petro-UPUnocal-UP

Schlumberger-UP

Oil-UP

Clustering precipitation in Australia

What is not Cluster Analysis Supervised classification

Simple segmentation

Results of a query

Graph partitioning

Notion of a Cluster can be Ambiguous

How many clusters

Four Clusters Two Clusters

Six Clusters

Types of Clusterings A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

Hierarchical clustering

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100

Discovered Clusters

Industry Group

1

Applied-Matl-DOWNBay-Network-Down3-COM-DOWN

Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN

DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN

Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down

Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN

Sun-DOWN

Technology1-DOWN

2

Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN

ADV-Micro-Device-DOWNAndrew-Corp-DOWN

Computer-Assoc-DOWNCircuit-City-DOWN

Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN

Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN

Technology2-DOWN

3

Fannie-Mae-DOWNFed-Home-Loan-DOWN

MBNA-Corp-DOWNMorgan-Stanley-DOWN

Financial-DOWN

4

Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP

Louisiana-Land-UPPhillips-Petro-UPUnocal-UP

Schlumberger-UP

Oil-UP

Page 3: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Applications of Cluster Analysis Understanding

Summarization

Discovered Clusters Industry Group

1 Applied-Matl-DOWNBay-Network-Down3-COM-DOWN Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN

DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down

Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN Sun-DOWN

Technology1-DOWN

2 Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN ADV-Micro-Device-DOWNAndrew-Corp-DOWN

Computer-Assoc-DOWNCircuit-City-DOWN Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN

Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWNFed-Home-Loan-DOWN MBNA-Corp-DOWNMorgan-Stanley-DOWN

Financial-DOWN

4 Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP Louisiana-Land-UPPhillips-Petro-UPUnocal-UP

Schlumberger-UP

Oil-UP

Clustering precipitation in Australia

What is not Cluster Analysis Supervised classification

Simple segmentation

Results of a query

Graph partitioning

Notion of a Cluster can be Ambiguous

How many clusters

Four Clusters Two Clusters

Six Clusters

Types of Clusterings A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

Hierarchical clustering

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100

Discovered Clusters

Industry Group

1

Applied-Matl-DOWNBay-Network-Down3-COM-DOWN

Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN

DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN

Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down

Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN

Sun-DOWN

Technology1-DOWN

2

Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN

ADV-Micro-Device-DOWNAndrew-Corp-DOWN

Computer-Assoc-DOWNCircuit-City-DOWN

Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN

Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN

Technology2-DOWN

3

Fannie-Mae-DOWNFed-Home-Loan-DOWN

MBNA-Corp-DOWNMorgan-Stanley-DOWN

Financial-DOWN

4

Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP

Louisiana-Land-UPPhillips-Petro-UPUnocal-UP

Schlumberger-UP

Oil-UP

Page 4: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

What is not Cluster Analysis Supervised classification

Simple segmentation

Results of a query

Graph partitioning

Notion of a Cluster can be Ambiguous

How many clusters

Four Clusters Two Clusters

Six Clusters

Types of Clusterings A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

Hierarchical clustering

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 5: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Notion of a Cluster can be Ambiguous

How many clusters

Four Clusters Two Clusters

Six Clusters

Types of Clusterings A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

Hierarchical clustering

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 6: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusterings A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

Hierarchical clustering

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 7: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 8: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2 p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 9: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 10: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters

Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 11: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters Well-Separated

Well-Separated Clusters

3 well-separated clusters

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 12: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters Center-Based

Center-based

centroidmedoid

4 center-based clusters

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 13: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)

8 contiguous clusters

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 14: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters Density-Based

Density-based

6 density-based clusters

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 15: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters Conceptual Clusters

Shared Property or Conceptual Clusters

2 Overlapping Circles

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 16: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters Objective Function

Clusters Defined by an Objective Function

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 17: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Types of Clusters Objective Function hellip

Map the clustering problem to a different domain and solve a related problem in that domain

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 18: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Characteristics of the Input Data Are Important

Type of proximity or density measure

Sparseness

Attribute type

Type of Data

Dimensionality Noise and Outliers Type of Distribution

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 19: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Clustering Algorithms

K-means and its variants

Hierarchical clustering

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 20: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

K-means Clustering

o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 21: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

K-means Clustering ndash Details

Initial centroids are often chosen randomly

The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity

correlation etc K-means will converge for common similarity measures mentioned

above Most of the convergence happens in the first few iterations

Complexity is O( n K I d )

sumisin

=iCxi

i xC

m||

1

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 22: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Two different K-means Clusterings

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 23: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 24: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Importance of Choosing Initial Centroids

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 6

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 25: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

sumsum= isin

=K

i Cxi

i

xmdistSSE1

2 )(

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 26: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 5

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 27: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Importance of Choosing Initial Centroids hellip

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 1

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 2

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 3

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

x

y

Iteration 4

-2 -15 -1 -05 0 05 1 15 2

0

05

1

15

2

25

3

xy

Iteration 5

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 28: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Problems with Selecting Initial Points

If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 29: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 30: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 31: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 32: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

10 Clusters Example

Starting with some pairs of clusters having three initial centroids while other have only one

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 33: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Solutions to Initial Centroids Problem

o Multiple runs

o Sample and use hierarchical clustering to determine initial centroids

o Select more than k initial centroids and then select among these initial centroids

o Postprocessing o Bisecting K-means

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 34: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Handling Empty Clusters

Basic K-means algorithm can yield empty clusters

Several strategies

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 35: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Updating Centers Incrementally

In the basic K-means algorithm centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 36: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Pre-processing and Post-processing

Pre-processing

Post-processing

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 37: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Bisecting K-means

Bisecting K-means algorithm

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 38: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Bisecting K-means Example

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 39: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Limitations of K-means

K-means has problems when clusters are of differing

K-means has problems when the data contains outliers

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 40: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Limitations of K-means Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 41: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Limitations of K-means Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 42: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Limitations of K-means Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 43: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters Find parts of clusters but need to put together

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 44: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 45: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Overcoming K-means Limitations

Original Points K-means Clusters

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 46: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram

1 3 2 5 4 60

005

01

015

02

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 47: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 48: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering

Two main types of hierarchical clustering

Traditional hierarchical algorithms use a similarity or distance matrix

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 49: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Agglomerative Clustering Algorithm More popular hierarchical clustering technique

Basic algorithm is straightforward

Key operation is the computation of the proximity of two clusters

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 50: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 51: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Intermediate Situation

After some merging steps we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 52: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 53: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

After Merging

The question is ldquoHow do we update the proximity matrixrdquo

C1

C4

C2 U C5

C3

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 54: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Similarity

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 55: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 56: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 57: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 58: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5

Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function ndash Wardrsquos Method uses squared error

times times

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 59: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 60: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 61: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet2

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 62: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet3

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 63: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering MIN

Nested Clusters Dendrogram

1

2 3

4

5

6

1 2

3

4

5

3 6 2 5 4 10

005

01

015

02

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 64: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Strength of MIN

Original Points Two Clusters

bull Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 65: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Limitations of MIN

Original Points Two Clusters

bull Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 66: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 67: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet1

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 68: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet2

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 69: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet3

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 70: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

03

035

04

1

2

3

4

5

6 1

2 5

3

4

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 71: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Strength of MAX

Original Points Two Clusters

bull Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 72: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Limitations of MAX

Original Points Two Clusters

bullTends to break large clusters

bullBiased towards globular clusters

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 73: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)ppproximity(

)ClusterClusterproximity(ji

ClusterpClusterp

ji

jijjii

lowast=

sumisinisin

I1 I2 I3 I4 I5I1 100 090 010 065 020I2 090 100 070 060 050I3 010 070 100 040 030I4 065 060 040 100 080I5 020 050 030 080 100 1 2 3 4 5

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 74: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet1

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
I1 I2 I3 I4 I5
I1 100 090 010 065 020
I2 090 100 070 060 050
I3 010 070 100 040 030
I4 065 060 040 100 080
I5 020 050 030 080 100
Page 75: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet2

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 76: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Sheet3

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 77: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

005

01

015

02

025

1

2

3

4

5

6 1

2

5

3

4

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 78: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths

Limitations

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 79: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Cluster Similarity Wardrsquos Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 80: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering Comparison

Group Average

Wardrsquos Method

1

2 3

4

5

6 1

2

5

3

4

MIN MAX

1

2 3

4

5

6 1

2

5

3 4

1

2 3

4

5

6 1

2 5

3

4 1

2 3

4

5

6 1

2

3

4

5

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 81: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering Time and Space requirements

O(N2) space since it uses the proximity matrix

O(N3) time in many cases

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 82: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 83: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

MST Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 84: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 85: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 86: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Clusters found in Random Data

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Random Points

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Complete Link

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 87: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data

2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels

3 Evaluating how well the results of a cluster analysis fit the data without reference to external information

- Use only the data 4 Comparing the results of two different sets of cluster analyses to

determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters

Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 88: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index

o Internal Index

o Relative Index

Sometimes these are referred to as criteria instead of indices

Measures of Cluster Validity

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 89: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Two matrices

Compute the correlation between the two matrices

High correlation indicates that points that belong to the same cluster are close to each other

Not a good measure for some density or contiguity based clusters

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 90: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 91: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 92: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

DBSCAN

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 93: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 94: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Points

Poin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

01

02

03

04

05

06

07

08

09

1

Complete Link

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 95: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

01

02

03

04

05

06

07

08

09

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 96: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure

without respect to external information

SSE is good for comparing two clusterings or two clusters (average SSE)

Can also be used to estimate the number of clusters

Internal Measures SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SSE

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 97: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Internal Measures SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 98: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Need a framework to interpret any measure

Statistics provide a framework for cluster validity

For comparing the results of two different sets of cluster analyses a framework is less necessary

Framework for Cluster Validity

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 99: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Example

Statistical Framework for SSE

0016 0018 002 0022 0024 0026 0028 003 0032 00340

5

10

15

20

25

30

35

40

45

50

SSE

Cou

nt

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 100: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets

Statistical Framework for Correlation

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

x

y

Corr = -09235 Corr = -05810

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 101: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Cluster Cohesion Measures how closely related are objects in a cluster

Cluster Separation Measure how distinct or well-separated a cluster is from other clusters

Example Squared Error

Internal Measures Cohesion and Separation

sum sumisin

minus=i Cx

ii

mxWSS 2)(

sum minus=i

ii mmCBSS 2)(

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 102: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Internal Measures Cohesion and Separation

Example SSE

1 2 3 4 5 times times times m1 m2

m

10919)354(2)513(2

1)545()544()512()511(22

2222

=+==minustimes+minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=2 clusters

100100)33(4

10)35()34()32()31(2

2222

=+==minustimes=

=minus+minus+minus+minus=

TotalBSSWSSK=1 cluster

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 103: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

A proximity graph based approach can also be used for cohesion and separation

Internal Measures Cohesion and Separation

cohesion separation

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 104: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings

For an individual point i

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures Silhouette Coefficient

ab

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 105: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

External Measures of Cluster Validity Entropy and Purity

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity
Page 106: CSE4334/5334 Data Mining Clustering - University of …crystal.uta.edu/~cli/cse5334/slides/cse5334-fall17-13-clustering.pdf · CSE4334/5334 Data Mining. Clustering . ... Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis

Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo

Algorithms for Clustering Data Jain and Dubes

Final Comment on Cluster Validity

  • Slide Number 1
  • What is Cluster Analysis
  • Applications of Cluster Analysis
  • What is not Cluster Analysis
  • Notion of a Cluster can be Ambiguous
  • Types of Clusterings
  • Partitional Clustering
  • Hierarchical Clustering
  • Other Distinctions Between Sets of Clusters
  • Types of Clusters
  • Types of Clusters Well-Separated
  • Types of Clusters Center-Based
  • Types of Clusters Contiguity-Based
  • Types of Clusters Density-Based
  • Types of Clusters Conceptual Clusters
  • Types of Clusters Objective Function
  • Types of Clusters Objective Function hellip
  • Characteristics of the Input Data Are Important
  • Clustering Algorithms
  • K-means Clustering
  • K-means Clustering ndash Details
  • Two different K-means Clusterings
  • Importance of Choosing Initial Centroids
  • Importance of Choosing Initial Centroids
  • Evaluating K-means Clusters
  • Importance of Choosing Initial Centroids hellip
  • Importance of Choosing Initial Centroids hellip
  • Problems with Selecting Initial Points
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • 10 Clusters Example
  • Solutions to Initial Centroids Problem
  • Handling Empty Clusters
  • Updating Centers Incrementally
  • Pre-processing and Post-processing
  • Bisecting K-means
  • Bisecting K-means Example
  • Limitations of K-means
  • Limitations of K-means Differing Sizes
  • Limitations of K-means Differing Density
  • Limitations of K-means Non-globular Shapes
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Overcoming K-means Limitations
  • Hierarchical Clustering
  • Strengths of Hierarchical Clustering
  • Hierarchical Clustering
  • Agglomerative Clustering Algorithm
  • Starting Situation
  • Intermediate Situation
  • Intermediate Situation
  • After Merging
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • How to Define Inter-Cluster Similarity
  • Cluster Similarity MIN or Single Link
  • Hierarchical Clustering MIN
  • Strength of MIN
  • Limitations of MIN
  • Cluster Similarity MAX or Complete Linkage
  • Hierarchical Clustering MAX
  • Strength of MAX
  • Limitations of MAX
  • Cluster Similarity Group Average
  • Hierarchical Clustering Group Average
  • Hierarchical Clustering Group Average
  • Cluster Similarity Wardrsquos Method
  • Hierarchical Clustering Comparison
  • Hierarchical Clustering Time and Space requirements
  • Hierarchical Clustering Problems and Limitations
  • MST Divisive Hierarchical Clustering
  • MST Divisive Hierarchical Clustering
  • Cluster Validity
  • Clusters found in Random Data
  • Different Aspects of Cluster Validation
  • Measures of Cluster Validity
  • Measuring Cluster Validity Via Correlation
  • Measuring Cluster Validity Via Correlation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Using Similarity Matrix for Cluster Validation
  • Internal Measures SSE
  • Internal Measures SSE
  • Framework for Cluster Validity
  • Statistical Framework for SSE
  • Statistical Framework for Correlation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Cohesion and Separation
  • Internal Measures Silhouette Coefficient
  • External Measures of Cluster Validity Entropy and Purity
  • Final Comment on Cluster Validity