Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

CLUSTERING

Komate Amphawan

1

Introduction

• Until now, we've assumed our training samples are “labeled” by their category membership.

• Methods that use labeled samples are said to be supervised; otherwise, they're said to be unsupervised.unsupervised.

• However:

� Why would one even be interested in learning with unlabeled samples?

� Is it even possible in principle to learn anything of value from unlabeled samples?

2

Why Unsupervised Learning? [1]

• Collecting and labeling a large set of sample

patterns can be surprisingly costly.

� E.g., videos are virtually free, but accurately

labeling the video pixels is expensive and time labeling the video pixels is expensive and time

consuming.

3


• Extend to a larger training set by using semi-

supervised learning.

� Train a classier on a small set of samples, then

tune it up to make it run without supervision on a tune it up to make it run without supervision on a large, unlabeled set.

� Or, in the reverse direction, let a large set of

unlabeled data group automatically, then label the groupings found.

4


• To detect the gradual change of pattern over

time.

• To find features that will then be useful for

categorization.categorization.

• To gain insight into the nature or structure of

the data during the early stages of an investigation.

5

What is Cluster Analysis?

• Finding groups of objects such that the objects

in a group will be similar (or related) to one

another and different from (or unrelated to) the objects in other groupsthe objects in other groups

6

Applications of Cluster Analysis

• Understanding

� Group related documents

for browsing, group genes

and proteins that have

similar functionality, or

group stocks with similar group stocks with similar

price fluctuations

• Summarization� Reduce the size of large

data sets

7

What is not Cluster Analysis?

• Supervised classification

� Have class label information

• Simple segmentation

� Dividing students into different registration groups � Dividing students into different registration groups alphabetically, by last name

• Results of a query

� Groupings are a result of an external specification

• Graph partitioning

� Some mutual relevance and synergy, but areas are not identical

8

Why data clustering?

• Natural Classification: degree of similarity among forms.

• Data exploration: discover underlying structure, generate hypotheses, detect anomalies.

• Compression: for organizing data.• Compression: for organizing data.

• Applications: can be used by any scientific field that collects data!

• Google Scholar: 1500 clustering papers in 2007 alone!

9A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice Hall, 1988

E.g.: Structure Discovering via Clustering

10http://clusty.com

E.g.: Topic Discovery

• 800,000 scientific

papers clustered into

776 topics based on

how often the

papers were cited papers were cited

together by authors of other papers

11Map of Science, Nature, 2006

Clustering for Data Understanding and Applications [1]

• Information retrieval: document clustering

• Land use: Identification of areas of similar land

use in an earth observation databaseuse in an earth observation database

• Marketing: Help marketers discover distinct

groups in their customer bases, and then use

this knowledge to develop targeted marketing

programs

12

Clustering for Data Understanding and Applications [2]

• City-planning: Identifying groups of houses according to their house type, value, and geographical location

• Earth-quake studies: Observed earth quake • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

• Climate: understanding earth climate, find patterns of atmospheric and ocean

• Economic Science: market research 13

Clustering as a Preprocessing Tool (Utility)

• Summarization:� Preprocessing for regression, PCA, classification, and

association analysis

• Compression:

� Image processing: vector quantization� Image processing: vector quantization

• Finding K-nearest Neighbors

� Localizing search to one or a small number of clusters

• Outlier detection

� Outliers are often viewed as those “far away” from any cluster

14

The Problem of Clustering

• Given a set of points, with a notion of distance

between points, group the points into some

number of clusters, so that members of a

cluster are in some sense as close to each cluster are in some sense as close to each other as possible.

15

16

Notion of a Cluster can be Ambiguous

17

Similarity/Dissimilarity metric

18

Properties of distance metrices

19

TYPES OF CLUSTERING

20

Types of Clusterings

21

Partitional Clustering

22

Hierarchical Clustering

23

Other Distinctions Between Sets of Clusters [1]

• Exclusive versus non-exclusive

� In non-exclusive clusterings, points may belong to

multiple clusters.

� Can represent multiple classes or ‘border’ points� Can represent multiple classes or ‘border’ points

• Fuzzy versus non-fuzzy

� In fuzzy clustering, a point belongs to every

cluster with some weight between 0 and 1

� Weights must sum to 1

� Probabilistic clustering has similar characteristics

24

Other Distinctions Between Sets of Clusters [2]

• Partial versus complete

� In some cases, we only want to cluster some of

the data

• Heterogeneous versus homogeneous• Heterogeneous versus homogeneous

� Cluster of widely different sizes, shapes, and densities

25

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters• Density-based clusters

• Property or Conceptual

• Described by an Objective Function

26

Types of Clusters: Well-Separated

• Well-Separated Clusters:

� A cluster is a set of points such that any point in a

cluster is closer (or more similar) to every other

point in the cluster than to any point not in the point in the cluster than to any point not in the cluster.

27

Types of Clusters: Center-Based

• Center-based

� A cluster is a set of objects such that an object in a

cluster is closer (more similar) to the “center” of a

cluster, than to the center of any other cluster

� The center of a cluster is often a centroid, the average � The center of a cluster is often a centroid, the average

of all the points in the cluster, or a medoid, the most “representative” point of a cluster

28

Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or

Transitive)

� A cluster is a set of points such that a point in a cluster

is closer (or more similar) to one or more other points is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

29

Types of Clusters: Density-Based

• Density-based

� A cluster is a dense region of points, which is

separated by low-density regions, from other regions

of high density.

� Used when the clusters are irregular or intertwined,

30

� Used when the clusters are irregular or intertwined, and when noise and outliers are present.

Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters

� Finds clusters that share some common property or represent a particular concept.

31

Types of Clusters: Objective Function

• Clusters Defined by an Objective Function

� Finds clusters that minimize or maximize an

objective function.

� Enumerate all possible ways of dividing the points � Enumerate all possible ways of dividing the points

into clusters and evaluate the `goodness' of each

potential set of clusters by using the given objective function. (NP Hard)

32

Types of Clusters: Ob jective Function …

� Can have global or local objectives.

• Hierarchical clustering algorithms typically have local

objectives

• Partitional algorithms typically have global objectives

� A variation of the global objective function approach

is to fit the data to a parameterized model.

• Parameters for the model are determined from the data.

• Mixture models assume that the data is a ‘mixture' of a number of statistical distributions.

33

Types of Clusters: Objective Function …

• Map the clustering problem to a different domain and solve a related problem in that domain

� Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between weighted edges represent the proximities between points

� Clustering is equivalent to breaking the graph into connected components, one for each cluster.

� Want to minimize the edge weight between clusters and maximize the edge weig ht within clusters

34

Characteristics of the Input Data Are Important [1]

• Type of proximity or density measure

� This is a derived measure, but central to clustering

• Sparseness

� Dictates(ควบคมุ) type of similarity

� Adds to efficiency

• Attribute type

� Dictates type of similarity

35

Characteristics of the Input Data Are Important [2]

• Type of Data

� Dictates type of similarity

� Other characteristics, e.g., autocorrelation

• Dimensionality

• Noise and Outliers

• Type of Distribution

36

CLUSTERING ALGORITHMS

37

Clustering Algorithms

• K-means and its variants

• Hierarchical clustering

• Density-based clustering

38

39

K-means Clustering

• Partitional clustering approach

• Each cluster is associated with a centroid

(center point)

• Each point is assigned to the cluster with the • Each point is assigned to the cluster with the

closest centroid

• Number of clusters, K, must be specified

• The basic algorithm is very simple

40

• Initial centroids are often chosen randomly.

� Clusters produced vary from one run to another.� Clusters produced vary from one run to another.

• The centroid is (typically) the mean of the

points in the cluster.

• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

41

K-means Clustering – Details

• K-means will converge for common similarity

measures mentioned above.

• Most of the convergence happens in the first

few iterations.few iterations.

� Often the stopping condition is changed to ‘Until

relatively few points change clusters’

• Complexity is O( n * K * I * d )

� n = number of points, K = number of clusters,

� I = number of iterations, d = number of attributes

42

EXAMPLE (ASK TEACHER TO EXPLAIN)

43

44

45

46

47

Example: image segmentation

• K-means cluster all pixels using their colour

vectors (3D)

• assign pixels to their clusters

• colour pixels by their cluster assignment• colour pixels by their cluster assignment

48

Example application 2: face clustering

• Determine the principal cast of a feature film

• Approach: view this as a clustering problem

on faces

• Algorithm outline

� Detect faces for every fifth frame in the movie

� Describe the face by a vector of intensities

� Cluster using a K-means algorithm

49

Example – “Ground Hog Day” 2000 frames

50

51

Two different K-means Clusterings

52

Importance of Choosing Initial Centroids

53

Importance of Choosing Initial Centroids

54

Evaluating K-means Clusters [1]

• Most common measure is Sum of Squared

Error (SSE)

� For each point, the error is the distance to the

nearest cluster

� To get SSE, we square these errors and sum them.� To get SSE, we square these errors and sum them.

� x is a data point in cluster Ci and mi is the

representative point for cluster Ci

• can show that mi corresponds to the center (mean) of the cluster 55

Evaluating K-means Clusters [2]

� Given two clusters, we can choose the one with

the smallest error

� One easy way to reduce SSE is to increase K, the

number of clustersnumber of clusters

• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

56

Importance of Choosing Initial Centroids …

57


58


• For example, if K = 10, then

probability = 10!/1010= 0.00036

• Sometimes the initial centroids will readjust

themselves in ‘right’ way, and sometimes they themselves in ‘right’ way, and sometimes they

don’t

• Consider an example of five pairs of clusters

59

Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of

selecting one centroid from each cluster is

small.

� Chance is relatively small when K is large� Chance is relatively small when K is large

� If clusters are the same size, n, then

60

10 Clusters Example

61

10 Clusters Example

62

10 Clusters Example: another initialization

63

10 Clusters Example: another initialization

64

Solutions to Initial Centroids Problem

• Multiple runs

� Helps, but probability is not on your side

• Sample and use hierarchical clustering to • Sample and use hierarchical clustering to

determine initial centroids

• Select more than k initial centroids and then

select among these initial centroids

65

Solutions to Initial Centroids Problem

• Select most widely separated

• Postprocessing

• Bisecting K-means

� Not as susceptible(มีความรู้สกึไว) to initialization issues

66

Handling Empty Clusters

• Basic K-means algorithm can yield empty

clusters

• Several strategies• Several strategies

� Choose the point that contributes most to SSE

� Choose a point from the cluster with the highest SSE

� If there are several empty clusters, the above can be repeated several times.

67

Updating Centers Incrementally

68

Pre-processing and Post-processing

69

Bisecting K-means

• Extension of the basic K-Means algorithm

• Basic idea: Initially split the data into two

cluster, then further split one of the clusters, cluster, then further split one of the clusters,

and so on, until there are K clusters

• Side-product: results in hierarchical clusters

70

Bisecting K-Means Algorithm

• Initialization: Set of clusters contains one cluster with all points

• Repeat until list of clusters contains K clusters

Remove cluster from listRemove cluster from list

For number of trials do:

Bisect cluster with basic K-Means

Select bisection with lowest total SSE

Add both clusters to list of cluster

71

Bisecting K-means Example

72

Bisecting K-Means Example

73

Bisecting K-Means

• Which cluster should be selected for

bisection?

� Cluster with largest SSE

� Largest cluster (in terms of number of points)� Largest cluster (in terms of number of points)

• The ‘trials’ in the bisecting K-Means algorithm

try different seed initializations (see basic K-Means)

74

Limitations of K-means

75

Limitations of K-means: Differing Sizes

76

Limitations of K-means: Differing Density

77

Limitations of K-means: Non-globular Shapes

78

Overcoming K-means Limitations

79


80


81

K-Means: Pros and Cons

• K-Means is a simple clustering algorithm

• Runs efficiently

• Susceptible to initialization (but bisecting K-

Means overcomes this to some extent)Means overcomes this to some extent)

• Limitations with respect to different size and

densities and non-globular shapes

• Difficult to predict the best value for K in advance

82


83

Dendrogram (ctd)• While merging cluster one by one, we can draw a tree diagram known as

dendrogram

• Dendrograms are used to represent agglomerative clustering

• From dendrograms, we can get any number of clusters

• Eg: say we wish to have 2 clusters, then cut the top one link

� Cluster 1: q, r

� Cluster 2: x, y, z, p

• Similarly for 3 clusters, cut 2 top links• Similarly for 3 clusters, cut 2 top links

� Cluster 1: q, r

� Cluster 2: x, y, z

� Cluster 3: p

84

A dendrogram example

Strengths of Hierarchical Clustering

85


86

Hierarchical clustering - algorithm

• Hierarchical clustering algorithm is a type of agglomerative clustering

• Given a set of N items to be clustered, hierarchical clustering algorithm:

1. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item

2. Find the closest distance (most similar) pair of clusters and merge them into a

single cluster, so that now you have one less cluster

3. Compute pairwise distances between the new cluster and each of the old 3. Compute pairwise distances between the new cluster and each of the old

clusters

4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N

5. Draw the dendogram, and with the complete hierarchical tree, if you want K

clusters you just have to cut the K-1 top links

Note any distance measure can be used: Euclidean, Manhattan, etc

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008

Hierarchical clustering – an example• Assume X=[3 7 10 17 18 20]

1. There are 6 items, so create 6 clusters initially

2. Compute pairwise distances of clusters (assume Manhattan distance)

The closest clusters are 17 and 18 (with distance=1), so merge these two clusters

togethertogether

3. Repeat step 2 (assume single-linkage):

The closest clusters are cluster17,18 to cluster20 (with distance |18-20|=2), so merge

these two clusters together

Hierarchical clustering – an example

(ctd)• Go on repeat cluster mergers until one big cluster remains

• Draw dendrogram (draw it in reverse of the cluster mergers) – remember

the height of each cluster corresponds to the manner of cluster the height of each cluster corresponds to the manner of cluster

agglomeration

89

Starting Situation

90

Intermediate Situation

91

Intermediate Situation

92

After Merging

93

How to Define Inter-Cluster Similarity

94


95


96


97


98

Cluster Similarity: MI N or Single Link

99

Hierarchical Clustering: MIN

100

Strength of MIN

101

Limitations of MIN

102

Cluster Similarity: MAX or Complete Linkage

103

Hierarchical Clustering: MAX

104

Strength of MAX

105

Limitations of MAX

106

Cluster Similarity: Group Average

107

Hierarchical Clustering: Group Average

108

Hierarchical Clustering: Group Average

• Compromise between Single and Complete

Link

• Strengths• Strengths

� Less susceptible to noise and outliers

• Limitations

� Biased towards globular clusters

109

Comparing agglomerative vs exclusive (partitioning)

clustering

• Agglomerative - advantages

� Preferable for detailed data analysis

� Provides more information than exclusive clustering

� We can decide on any number of clusters without the need to redo

the algorithm –in exclusive clustering, K has to be decided first, if a

different K is used, then need to redo the whole exclusive clustering

algorithmalgorithm

� One unique answer

• Disadvantages

� Less efficient than exclusive clustering

� No backtracking, i.e. can never undo previous steps


Cluster Similarity: Ward’s Method

• Similarity of two clusters is based on the increase in squared error when two clusters are merged

� Similar to group average if distance between points is distance squaredSimilar to group average if distance between points is distance squared

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of K-means� Can be used to initialize K-means

111

Hierarchical Clustering: Comparison

112

Hierarchical Clustering: Time and Space requirements

113

Hierarchical Clustering: Problems and Limitations

114

MST: Divisive Hierarchical Clustering

• Build MST (Minimum Spanning Tree)� Start with a tree that consists of any point

� In successive steps, look for the closest pair of points (p, q) such that

one point (p) is in the current tree but the other (q) is not

� Add q to the tree and put an edge between p and q� Add q to the tree and put an edge between p and q

MST: Divisive Hierarchical Clustering

• Use MST for constructing hierarchy of clusters

Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as

density-connected points

• Major features:

� Discover clusters of arbitrary shape

� Handle noise

� One scan� One scan

� Need density parameters as termination condition

• Several interesting studies:

� DBSCAN: Ester, et al. (KDD’96)

� OPTICS: Ankerst, et al (SIGMOD’99).

� DENCLUE: Hinneburg & D. Keim (KDD’98)

� CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

117

Density-Based Clustering: Basic Concepts

• Two parameters:

� Eps: Maximum radius of the neighbourhood

� MinPts: Minimum number of points in an Eps-neighbourhood of that point

• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}

• Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

� p belongs to NEps(q)

� core point condition:

|NEps (q)| ≥ MinPts

MinPts = 5

Eps = 1 cm

p

q

118

Density-Reachable and Density-Connected

• Density-reachable:

� A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from p

p

qp1

reachable from pi

• Density-connected

� A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Epsand MinPts

p q

o

119

DBSCAN: Density-Based Spatial Clustering of

Applications with Noise

• Relies on a density-based notion of cluster: A cluster is defined

as a maximal set of density-connected points

• Discovers clusters of arbitrary shape in spatial databases with

noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5 120

DBSCAN

• DBSCAN is a density-based algorithm.� Density = number of points within a specified radius (Eps)

� A point is a core point if it has more than a specified number of

points (MinPts) within Eps

• These are points that are at the interior of a cluster

� A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

� A noise point is any point that is not a core point or a border point.

DBSCAN: Core, Border, and Noise Points

DBSCAN: The Algorithm

• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. Eps and

MinPts

• If p is a core point, a cluster is formed

• If p is a border point, no points are density-reachable from p

and DBSCAN visits the next point of the database

• Continue the process until all of the points have been

processed

• If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2)

123

DBSCAN: Sensitive to Parameters

124

DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border

and noise

Eps = 10, MinPts = 4

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

DBSCAN: Determining EPS and MinPts

• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

• Noise points have the kth nearest neighbor at farther distance

• So, plot sorted distance of every point to its kth nearest neighbor

Cluster Validity

• For supervised classification we have a variety of measures to evaluate how good our model is� Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?� To avoid finding patterns in noise� To compare clustering algorithms� To compare two sets of clusters� To compare two clusters

Clusters found in Random Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

yRand

om

Point

s 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y DBSC

AN

0 0.2 0.4 0.6 0.8 10

xs

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-

mea

ns

0 0.2 0.4 0.6 0.8 10

x

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Compl

ete

Link

1. Determining the clustering tendency of a set of data, i.e., distinguishing

whether non-random structure actually exists in the data.

2. Comparing the results of a cluster analysis to externally known results,

e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without

reference to external information.

- Use only the data

Different Aspects of Cluster Validation

4. Comparing the results of two different sets of cluster analyses to

determine which is better.

5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate

the entire clustering or just individual clusters.

• Numerical measures that are applied to judge various aspects of

cluster validity, are classified into the following three types.

� External Index: Used to measure the extent to which cluster labels match

externally supplied class labels.• Entropy

� Internal Index: Used to measure the goodness of a clustering structure

without respect to external information. • Sum of Squared Error (SSE)

Measures of Cluster Validity

• Sum of Squared Error (SSE)

� Relative Index: Used to compare two different clusterings or clusters. • Often an external or internal index is used for this function, e.g., SSE or entropy

• Sometimes these are referred to as criteria instead of indices

� However, sometimes criterion is the general strategy and index is the numerical

measure that implements the criterion.

• Two matrices

� Proximity Matrix

� “Incidence” Matrix

• One row and one column for each data point

• An entry is 1 if the associated pair of points belong to the same cluster

• An entry is 0 if the associated pair of points belongs to different clusters

• Compute the correlation between the two matrices

Since the matrices are symmetric, only the correlation between

Measuring Cluster Validity Via Correlation

� Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

• High correlation indicates that points that belong to the same

cluster are close to each other.

• Not a good measure for some density or contiguity based

clusters.

Measuring Cluster Validity Via Correlation

• Correlation of incidence and proximity

matrices for the K-means clusterings of the

following two data sets. 0.8

0.9

1

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

xy

Corr = -0.9235 Corr = -0.5810

• Order the similarity matrix with respect to cluster labels

and inspect visually.

Using Similarity Matrix for Cluster Validation

0.7

0.8

0.9

1

10

20

30

40 0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

x

y

Points

Po

ints

20 40 60 80 100

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6


• Clusters in random data are not so crisp

Po

ints

10

20

30

40

50 0.5

0.6

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

y

Points

Po

ints

20 40 60 80 100

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

x

y

Po

ints

10

20

30

40

50 0.5

0.6

0.7

0.8

0.9

1



0.5

0.6

0.7

0.8

0.9

1

y

Points

Po

ints

20 40 60 80 100

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

x



0.5

0.6

0.7

0.8

0.9

1

y

Po

ints

10

20

30

40

50 0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

x

y

Points

Po

ints

20 40 60 80 100

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

Complete Link


1 2

3

6

4

0.4

0.5

0.6

0.7

0.8

0.9

1

500

1000

1500

5

7

DBSCAN

0

0.1

0.2

0.3

0.4

500 1000 1500 2000 2500 3000

2000

2500

3000

• Clusters in more complicated figures aren’t well separated

• Internal Index: Used to measure the goodness of a clustering

structure without respect to external information

� SSE

• SSE is good for comparing two clusterings or two clusters

(average SSE).

• Can also be used to estimate the number of clusters

Internal Measures: SSE

• Can also be used to estimate the number of clusters

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures: SSE

• SSE curve for a more complicated data set

1 2

3

6

3

5

4

7

SSE of clusters found using K-means

• Need a framework to interpret any measure.

� For example, if our measure of evaluation has the value, 10, is that good, fair,

or poor?

• Statistics provide a framework for cluster validity

� The more “atypical” a clustering result is, the more likely it represents valid

structure in the data

� Can compare the values of an index that result from random data or

clusterings to those of a clustering result.

Framework for Cluster Validity

clusterings to those of a clustering result.

• If the value of the index is unlikely, then the cluster results are valid

� These approaches are more complicated and harder to understand.

• For comparing the results of two different sets of cluster

analyses, a framework is less necessary.

� However, there is the question of whether the difference between two index

values is significant

• Example� Compare SSE of 0.005 against three clusters in random data

� Histogram shows SSE of three clusters in 500 sets of random data points

of size 100 distributed over the range 0.2 – 0.8 for x and y values

Statistical Framework for SSE

501

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

SSE

Cou

nt

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

x

y

• Correlation of incidence and proximity matrices for the K-

means clusterings of the following two data sets.

Statistical Framework for Correlation

0.7

0.8

0.9

1

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

x

y

Corr = -0.9235 Corr = -0.5810

• Cluster Cohesion: Measures how closely related

are objects in a cluster� Example: SSE

• Cluster Separation: Measure how distinct or

well-separated a cluster is from other clusters

Example: Squared Error

Internal Measures: Cohesion and Separation

• Example: Squared Error

� Cohesion is measured by the within cluster sum of squares (SSE)

� Separation is measured by the between cluster sum of squares

� Where |Ci| is the size of cluster i

∑ ∑∈

−=i Cx

ii

mxWSS 2)(

∑ −=i

ii mmCBSS 2)(


• Example: SSE

� BSS + WSS = constant

1 2 3 4 5

× ××m1 m2

m

1091

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(22

2222

=+=

=−×+−×=

=−+−+−+−=

Total

BSS

WSSK=2 clusters:

10010

0)33(4

10)35()34()32()31(2

2222

=+=

=−×=

=−+−+−+−=

Total

BSS

WSSK=1 cluster:

• A proximity graph based approach can also be used for cohesion and

separation.

� Cluster cohesion is the sum of the weight of all links within a cluster.

� Cluster separation is the sum of the weights between nodes in the cluster and

nodes outside the cluster.


cohesion separation

• Silhouette Coefficient combine ideas of both cohesion and separation, but

for individual points, as well as clusters and clusterings

• For an individual point, i

� Calculate a = average distance of i to the points in its cluster

� Calculate b = min (average distance of i to points in another cluster)

� The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)

Internal Measures: Silhouette Coefficient

s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)

� Typically between 0 and 1.

� The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

ab

External Measures of Cluster Validity: Entropy and Purity

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience

Final Comment on Cluster Validity

analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

USER'S DILEMMA(สภาวะลาํบาก)

151

User's Dilemma(สภาวะลาํบาก)

• What is a cluster?

• How to define pair-wise similarity?

• Which features and normalization scheme?

• How many clusters?• How many clusters?

• Which clustering method?

• Are the discovered clusters and partition valid?

• Does the data have any clustering tendency?

152R. Dubes and A. K. Jain, Clustering Techniques: User's Dilemma, PR 1976

Cluster Similarity?• Compact Clusters

� Within-cluster distance < between-cluster connectivity

• Connected Clusters

� Within-cluster connectivity > between-cluster

connectivity

• Ideal cluster:• Ideal cluster:

compact and isolated.

153

Representation (features)?

• There's no universal representation; they're domain dependent.

154

Good Representation

• A good representation leads to compact and isolated clusters.

155

How do we weigh the features?

• Two different meaningful groupings produced by different weighting schemes.

156

How do we decide the Number of Clusters?

• The samples are generated by 6 independent classes, yet:

157

Cluster Validity

• Clustering algorithms find clusters, even if

there are no natural clusters in the data.

158

Comparing Clustering Methods

• Which clustering algorithm is the best?

159

There's no best Clustering Algorithm!

• Each algorithm imposes a structure on data.

• Good fit between model and data ⇒ success.

160

MORE ON CLUSTERING

161

Overlapping clustering –Fuzzy C-

means algorithm• Both agglomerative and exclusive clustering allows one data to be in one

cluster only

• Fuzzy C-means (FCM) is a method of clustering which allows one piece of

data to belong to more than one cluster

• In other words, each data is a member of every cluster but with a certain

degree known as membership value

• This method (developed by Dunn in 1973 and improved by Bezdek in 1981)

is frequently used in pattern recognitionis frequently used in pattern recognition

• Fuzzy partitioning is carried out through an iterative procedure that

updates membership uij and the cluster centroids cj by

where m > 1, and represents the degree of fuzziness (typically, m=2)

162

∑

∑

=

==N

i

mij

N

ii

mij

j

u

xuc

1

11

2

1 ||||

||||

1

−

=∑

−

−

=mC

k ki

ji

ij

cx

cxu

Overlapping clusters?• Using both agglomerative and exclusive clustering methods, data X1 will be

member of cluster 1 only while X2 will be member of cluster 2 only

• However, using FCM, data X can be member of both clusters

• FCM uses distance measure too, so the further data is from that cluster centroid, the smaller the membership value will be

• For example, membership value for X1 from cluster 1, u11=0.73 and membership value for X1 from cluster 2, u12=0.27

• Similarly, membership value for X2 from cluster 2, u22=0.2 and membership value for X2 from cluster 1, u21=0.82 21

• Note: membership values are in the range 0 to 1 and membership values for each data from all the clusters will add to 1

163

Fuzzy C-means algorithm• Choose the number of clusters, C and m, typically 2

1. Initialise all uij, membership values randomly – matrix U0

2. At step k: Compute centroids, cj using

3. Compute new membership values, uij using

∑

∑

=

==N

i

mij

N

ii

mij

j

u

xuc

1

1

2

1

−

=iju3. Compute new membership values, uij using

4. Update Uk+1 � Uk

5. Repeat steps 2-4 until change of membership values is very small, Uk+1-

Uk <ε where ε is some small value, typically 0.01

Note: || means Euclidean distance, | means Manhattan

However, if the data is one dimensional (like the examples here), Euclidean distance

= Manhattan distance

164

1

2

1 ||||

|||| −

=∑

−

− mC

k ki

ji

cx

cx

Fuzzy C-means algorithm – an

example• X=[3 7 10 17 18 20] and assume C=2

• Initially, set U randomly

• Compute centroids, cj using , assume m=2

• c1=13.16; c2=11.81

5.09.07.04.08.09.0

5.01.03.06.02.01.0=U

∑

∑

=

==N

i

mij

N

ii

mij

j

u

xuc

1

1

• c1=13.16; c2=11.81

• Compute new membership values, uij using

• New U:

• Repeat centroid and membership computation until changes in

membership values are smaller than say 0.01

165

41.038.035.076.062.057.0

59.062.065.024.038.043.0=U

1

2

1 ||||

||||

1

−

=∑

−

−

=mC

k ki

ji

ij

cx

cxu

Fuzzy C-means algorithm –using

MATLAB• Using ‘fcm’ function in

MATLAB

• The final membership

values, U gives an indication

on similarity of each item to

the clusters

• For eg: item 3 (no. 10) is

more similar to cluster 1

compared to cluster 2 but

item 2 (no. 7) is even more

similar to cluster 1

166

Fuzzy C-means algorithm –using

MATLAB• ‘fcm’ function requires Fuzzy

Logic toolbox

• So, using MATLAB but without

‘fcm’ function:

∑N

m xu

167

∑

∑

=

==N

i

mij

ii

mij

j

u

xuc

1

1

1

2

1 ||||

||||

1

−

=∑

−

−

=mC

k ki

ji

ij

cx

cxu

Clustering validity problem

• Problem 1

• A problem we face in clustering is to decide the optimal number of

clusters that fits a data set

• Problem 2

• The various clustering algorithms behave in a different way

depending ondepending on

� the features of the data set the features of the data set (geometry and

density distribution of clusters)

� the input parameters values (eg: for K-means, initial cluster choices

influence the result)

• So, how do we know which clustering method is better/suitable?

• We need a clustering quality criteria


Clustering validity problem

• In general, good clusters, should have

� High intra–cluster similarity, i.e. low variance among intra-cluster members

where variance for x is defined by

with as the mean of x

• For eg: if x=[2 4 6 8], then so var(x)=6.67

∑=

−−

=N

ii xx

Nx

1

2)(1

1)var(

x

5=x

• Computing intra-cluster similarity is simple

• For eg: for the clusters shown

• var(cluster1)=2.33 while var(cluster2)=12.33

• So, cluster 1 is better than cluster 2

• Note: use ‘var’ function in MATLAB to compute variance


5=x

Clustering Quality Criteria• But this does not tell us anything about how good is the overall clustering or on the

suitable number of clusters needed!

• To solve this, we also need to compute inter-cluster variance

• Good clusters will also have low inter–cluster similarity, i.e. high variance among inter-cluster members in addition to high intra–cluster similarity, i.e. low variance among intra-cluster members

• One good measure of clustering quality is Davies-Bouldin• One good measure of clustering quality is Davies-Bouldinindex

• The others are

– Dunn’s Validity Index

– Silhouette method

– C–index

– Goodman–Kruskal index• So, we compute DB index for different number of clusters, K and the best value of DB

index tells us on the appropriate K value or on how good is the clustering method

170

Davies-Bouldin index• It is a function of the ratio of the sum of within-cluster (i.e. intra-cluster)

scatter to between cluster (i.e. inter-cluster) separation

• Because a low scatter and a high distance between clusters lead to low values of Rij , a minimization of DB index is desired

• Let C={C1,….., Ck} be a clustering of a set of N objects:

∑=

=k

iiR

kDB

1

.1

with and

where Ci is the ith cluster and ci is the centroid for cluster i

Numerator of Rij is a measure of intra-cluster similarity while the denominator is a measure of inter-cluster separation

Note, Rij=Rji

171

jikjiji RR≠=

=,,..1

max||||

)var()var(

ji

ji

jiij cc

CCR

−

+=

≠

Davies-Bouldin index example


• Compute

• var(C1)=0, var(C2)=4.5, var(C3)=2.33

• Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33

• So, R =1, R =0.152, R =0.797

||||

)var()var(

ji

ji

jiij cc

CCR

−

+=

≠ Note, variance of one element is

zero and centroid is simply the

element itself

• So, R12=1, R13=0.152, R23=0.797

• Now, compute

• R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31

and R32)

• Finally, compute

• DB=0.932

Lecture 7

slides for

172

jikjiji RR≠=

=,,..1

max

∑=

=k

iiR

kDB

1

.1

Davies-Bouldin index example (ctd)


• Compute

• Only 2 clusters here

• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c =18.33

||||

)var()var(

ji

ji

jiij cc

CCR

−

+=

≠

• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33

• R12=1.26

• Now compute

• Since we have only 2 clusters here, R1=R12=1.26; R2=R21=1.26

• Finally, compute

• DB=1.26

173

jikjiji RR≠=

=,,..1

max

∑=

=k

iiR

kDB

1

.1

Davies-Bouldin index example (ctd)

• DB with 2 clusters=1.26, with 3 clusters=0.932

• So, K=3 is better than K=2 (since DB smaller, better clusters)

• In general, we will repeat DB index computation for all cluster sizes from 2 to N-1

• So, if we have 10 data items, we will do clustering • So, if we have 10 data items, we will do clustering with K=2,…..9 and then compute DB for each value of K� K=10 is not done since each item is its own cluster

• Then, we decide the best clustering size (and the best set of clusters) would be the one with minimum values of DB index

174

Study Guide

At the end of this lecture, you should be able to

• Define clustering

• Name major clustering approaches and differentiate between them

• State the K-means algorithm and apply it on a given data set

• State the hierarchical algorithm and apply it on a given data • State the hierarchical algorithm and apply it on a given data set

• Compare exclusive (parititioning) and agglomerative clustering methods

• Identify major problems with clustering techniques

• Define and use cluster validity measures such as DB index on a given data set


Q & A176

Documents

Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set