176
CLUSTERING Komate Amphawan 1

Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

CLUSTERING

Komate Amphawan

1

Page 2: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Introduction

• Until now, we've assumed our training samples are “labeled” by their category membership.

• Methods that use labeled samples are said to be supervised; otherwise, they're said to be unsupervised.unsupervised.

• However:

� Why would one even be interested in learning with unlabeled samples?

� Is it even possible in principle to learn anything of value from unlabeled samples?

2

Page 3: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Why Unsupervised Learning? [1]

• Collecting and labeling a large set of sample

patterns can be surprisingly costly.

� E.g., videos are virtually free, but accurately

labeling the video pixels is expensive and time labeling the video pixels is expensive and time

consuming.

3

Page 4: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Why Unsupervised Learning? [2]

• Extend to a larger training set by using semi-

supervised learning.

� Train a classier on a small set of samples, then

tune it up to make it run without supervision on a tune it up to make it run without supervision on a large, unlabeled set.

� Or, in the reverse direction, let a large set of

unlabeled data group automatically, then label the groupings found.

4

Page 5: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Why Unsupervised Learning? [3]

• To detect the gradual change of pattern over

time.

• To find features that will then be useful for

categorization.categorization.

• To gain insight into the nature or structure of

the data during the early stages of an investigation.

5

Page 6: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

What is Cluster Analysis?

• Finding groups of objects such that the objects

in a group will be similar (or related) to one

another and different from (or unrelated to) the objects in other groupsthe objects in other groups

6

Page 7: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Applications of Cluster Analysis

• Understanding

� Group related documents

for browsing, group genes

and proteins that have

similar functionality, or

group stocks with similar group stocks with similar

price fluctuations

• Summarization� Reduce the size of large

data sets

7

Page 8: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

What is not Cluster Analysis?

• Supervised classification

� Have class label information

• Simple segmentation

� Dividing students into different registration groups � Dividing students into different registration groups alphabetically, by last name

• Results of a query

� Groupings are a result of an external specification

• Graph partitioning

� Some mutual relevance and synergy, but areas are not identical

8

Page 9: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Why data clustering?

• Natural Classification: degree of similarity among forms.

• Data exploration: discover underlying structure, generate hypotheses, detect anomalies.

• Compression: for organizing data.• Compression: for organizing data.

• Applications: can be used by any scientific field that collects data!

• Google Scholar: 1500 clustering papers in 2007 alone!

9A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice Hall, 1988

Page 10: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

E.g.: Structure Discovering via Clustering

10http://clusty.com

Page 11: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

E.g.: Topic Discovery

• 800,000 scientific

papers clustered into

776 topics based on

how often the

papers were cited papers were cited

together by authors of other papers

11Map of Science, Nature, 2006

Page 12: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clustering for Data Understanding and Applications [1]

• Information retrieval: document clustering

• Land use: Identification of areas of similar land

use in an earth observation databaseuse in an earth observation database

• Marketing: Help marketers discover distinct

groups in their customer bases, and then use

this knowledge to develop targeted marketing

programs

12

Page 13: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clustering for Data Understanding and Applications [2]

• City-planning: Identifying groups of houses according to their house type, value, and geographical location

• Earth-quake studies: Observed earth quake • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

• Climate: understanding earth climate, find patterns of atmospheric and ocean

• Economic Science: market research 13

Page 14: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clustering as a Preprocessing Tool (Utility)

• Summarization:� Preprocessing for regression, PCA, classification, and

association analysis

• Compression:

� Image processing: vector quantization� Image processing: vector quantization

• Finding K-nearest Neighbors

� Localizing search to one or a small number of clusters

• Outlier detection

� Outliers are often viewed as those “far away” from any cluster

14

Page 15: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

The Problem of Clustering

• Given a set of points, with a notion of distance

between points, group the points into some

number of clusters, so that members of a

cluster are in some sense as close to each cluster are in some sense as close to each other as possible.

15

Page 16: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

16

Page 17: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Notion of a Cluster can be Ambiguous

17

Page 18: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Similarity/Dissimilarity metric

18

Page 19: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Properties of distance metrices

19

Page 20: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

TYPES OF CLUSTERING

20

Page 21: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusterings

21

Page 22: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Partitional Clustering

22

Page 23: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering

23

Page 24: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Other Distinctions Between Sets of Clusters [1]

• Exclusive versus non-exclusive

� In non-exclusive clusterings, points may belong to

multiple clusters.

� Can represent multiple classes or ‘border’ points� Can represent multiple classes or ‘border’ points

• Fuzzy versus non-fuzzy

� In fuzzy clustering, a point belongs to every

cluster with some weight between 0 and 1

� Weights must sum to 1

� Probabilistic clustering has similar characteristics

24

Page 25: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Other Distinctions Between Sets of Clusters [2]

• Partial versus complete

� In some cases, we only want to cluster some of

the data

• Heterogeneous versus homogeneous• Heterogeneous versus homogeneous

� Cluster of widely different sizes, shapes, and densities

25

Page 26: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters• Density-based clusters

• Property or Conceptual

• Described by an Objective Function

26

Page 27: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Well-Separated

• Well-Separated Clusters:

� A cluster is a set of points such that any point in a

cluster is closer (or more similar) to every other

point in the cluster than to any point not in the point in the cluster than to any point not in the cluster.

27

Page 28: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Center-Based

• Center-based

� A cluster is a set of objects such that an object in a

cluster is closer (more similar) to the “center” of a

cluster, than to the center of any other cluster

� The center of a cluster is often a centroid, the average � The center of a cluster is often a centroid, the average

of all the points in the cluster, or a medoid, the most “representative” point of a cluster

28

Page 29: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or

Transitive)

� A cluster is a set of points such that a point in a cluster

is closer (or more similar) to one or more other points is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

29

Page 30: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Density-Based

• Density-based

� A cluster is a dense region of points, which is

separated by low-density regions, from other regions

of high density.

� Used when the clusters are irregular or intertwined,

30

� Used when the clusters are irregular or intertwined, and when noise and outliers are present.

Page 31: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters

� Finds clusters that share some common property or represent a particular concept.

31

Page 32: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Objective Function

• Clusters Defined by an Objective Function

� Finds clusters that minimize or maximize an

objective function.

� Enumerate all possible ways of dividing the points � Enumerate all possible ways of dividing the points

into clusters and evaluate the `goodness' of each

potential set of clusters by using the given objective function. (NP Hard)

32

Page 33: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Ob jective Function …

� Can have global or local objectives.

• Hierarchical clustering algorithms typically have local

objectives

• Partitional algorithms typically have global objectives

� A variation of the global objective function approach

is to fit the data to a parameterized model.

• Parameters for the model are determined from the data.

• Mixture models assume that the data is a ‘mixture' of a number of statistical distributions.

33

Page 34: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Types of Clusters: Objective Function …

• Map the clustering problem to a different domain and solve a related problem in that domain

� Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between weighted edges represent the proximities between points

� Clustering is equivalent to breaking the graph into connected components, one for each cluster.

� Want to minimize the edge weight between clusters and maximize the edge weig ht within clusters

34

Page 35: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Characteristics of the Input Data Are Important [1]

• Type of proximity or density measure

� This is a derived measure, but central to clustering

• Sparseness

� Dictates(ควบคมุ) type of similarity

� Adds to efficiency

• Attribute type

� Dictates type of similarity

35

Page 36: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Characteristics of the Input Data Are Important [2]

• Type of Data

� Dictates type of similarity

� Other characteristics, e.g., autocorrelation

• Dimensionality

• Noise and Outliers

• Type of Distribution

36

Page 37: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

CLUSTERING ALGORITHMS

37

Page 38: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clustering Algorithms

• K-means and its variants

• Hierarchical clustering

• Density-based clustering

38

Page 39: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

39

Page 40: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

K-means Clustering

• Partitional clustering approach

• Each cluster is associated with a centroid

(center point)

• Each point is assigned to the cluster with the • Each point is assigned to the cluster with the

closest centroid

• Number of clusters, K, must be specified

• The basic algorithm is very simple

40

Page 41: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Initial centroids are often chosen randomly.

� Clusters produced vary from one run to another.� Clusters produced vary from one run to another.

• The centroid is (typically) the mean of the

points in the cluster.

• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

41

Page 42: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

K-means Clustering – Details

• K-means will converge for common similarity

measures mentioned above.

• Most of the convergence happens in the first

few iterations.few iterations.

� Often the stopping condition is changed to ‘Until

relatively few points change clusters’

• Complexity is O( n * K * I * d )

� n = number of points, K = number of clusters,

� I = number of iterations, d = number of attributes

42

Page 43: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

EXAMPLE (ASK TEACHER TO EXPLAIN)

43

Page 44: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

44

Page 45: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

45

Page 46: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

46

Page 47: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

47

Page 48: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Example: image segmentation

• K-means cluster all pixels using their colour

vectors (3D)

• assign pixels to their clusters

• colour pixels by their cluster assignment• colour pixels by their cluster assignment

48

Page 49: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Example application 2: face clustering

• Determine the principal cast of a feature film

• Approach: view this as a clustering problem

on faces

• Algorithm outline

� Detect faces for every fifth frame in the movie

� Describe the face by a vector of intensities

� Cluster using a K-means algorithm

49

Page 50: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Example – “Ground Hog Day” 2000 frames

50

Page 51: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

51

Page 52: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Two different K-means Clusterings

52

Page 53: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Importance of Choosing Initial Centroids

53

Page 54: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Importance of Choosing Initial Centroids

54

Page 55: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Evaluating K-means Clusters [1]

• Most common measure is Sum of Squared

Error (SSE)

� For each point, the error is the distance to the

nearest cluster

� To get SSE, we square these errors and sum them.� To get SSE, we square these errors and sum them.

� x is a data point in cluster Ci and mi is the

representative point for cluster Ci

• can show that mi corresponds to the center (mean) of the cluster 55

Page 56: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Evaluating K-means Clusters [2]

� Given two clusters, we can choose the one with

the smallest error

� One easy way to reduce SSE is to increase K, the

number of clustersnumber of clusters

• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

56

Page 57: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Importance of Choosing Initial Centroids …

57

Page 58: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Importance of Choosing Initial Centroids …

58

Page 59: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Importance of Choosing Initial Centroids …

• For example, if K = 10, then

probability = 10!/1010= 0.00036

• Sometimes the initial centroids will readjust

themselves in ‘right’ way, and sometimes they themselves in ‘right’ way, and sometimes they

don’t

• Consider an example of five pairs of clusters

59

Page 60: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of

selecting one centroid from each cluster is

small.

� Chance is relatively small when K is large� Chance is relatively small when K is large

� If clusters are the same size, n, then

60

Page 61: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

10 Clusters Example

61

Page 62: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

10 Clusters Example

62

Page 63: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

10 Clusters Example: another initialization

63

Page 64: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

10 Clusters Example: another initialization

64

Page 65: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Solutions to Initial Centroids Problem

• Multiple runs

� Helps, but probability is not on your side

• Sample and use hierarchical clustering to • Sample and use hierarchical clustering to

determine initial centroids

• Select more than k initial centroids and then

select among these initial centroids

65

Page 66: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Solutions to Initial Centroids Problem

• Select most widely separated

• Postprocessing

• Bisecting K-means

� Not as susceptible(มีความรู้สกึไว) to initialization issues

66

Page 67: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Handling Empty Clusters

• Basic K-means algorithm can yield empty

clusters

• Several strategies• Several strategies

� Choose the point that contributes most to SSE

� Choose a point from the cluster with the highest SSE

� If there are several empty clusters, the above can be repeated several times.

67

Page 68: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Updating Centers Incrementally

68

Page 69: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Pre-processing and Post-processing

69

Page 70: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Bisecting K-means

• Extension of the basic K-Means algorithm

• Basic idea: Initially split the data into two

cluster, then further split one of the clusters, cluster, then further split one of the clusters,

and so on, until there are K clusters

• Side-product: results in hierarchical clusters

70

Page 71: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Bisecting K-Means Algorithm

• Initialization: Set of clusters contains one cluster with all points

• Repeat until list of clusters contains K clusters

Remove cluster from listRemove cluster from list

For number of trials do:

Bisect cluster with basic K-Means

Select bisection with lowest total SSE

Add both clusters to list of cluster

71

Page 72: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Bisecting K-means Example

72

Page 73: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Bisecting K-Means Example

73

Page 74: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Bisecting K-Means

• Which cluster should be selected for

bisection?

� Cluster with largest SSE

� Largest cluster (in terms of number of points)� Largest cluster (in terms of number of points)

• The ‘trials’ in the bisecting K-Means algorithm

try different seed initializations (see basic K-Means)

74

Page 75: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Limitations of K-means

75

Page 76: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Limitations of K-means: Differing Sizes

76

Page 77: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Limitations of K-means: Differing Density

77

Page 78: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Limitations of K-means: Non-globular Shapes

78

Page 79: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Overcoming K-means Limitations

79

Page 80: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Overcoming K-means Limitations

80

Page 81: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Overcoming K-means Limitations

81

Page 82: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

K-Means: Pros and Cons

• K-Means is a simple clustering algorithm

• Runs efficiently

• Susceptible to initialization (but bisecting K-

Means overcomes this to some extent)Means overcomes this to some extent)

• Limitations with respect to different size and

densities and non-globular shapes

• Difficult to predict the best value for K in advance

82

Page 83: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering

83

Page 84: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Dendrogram (ctd)• While merging cluster one by one, we can draw a tree diagram known as

dendrogram

• Dendrograms are used to represent agglomerative clustering

• From dendrograms, we can get any number of clusters

• Eg: say we wish to have 2 clusters, then cut the top one link

� Cluster 1: q, r

� Cluster 2: x, y, z, p

• Similarly for 3 clusters, cut 2 top links• Similarly for 3 clusters, cut 2 top links

� Cluster 1: q, r

� Cluster 2: x, y, z

� Cluster 3: p

84

A dendrogram example

Page 85: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Strengths of Hierarchical Clustering

85

Page 86: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering

86

Page 87: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical clustering - algorithm

• Hierarchical clustering algorithm is a type of agglomerative clustering

• Given a set of N items to be clustered, hierarchical clustering algorithm:

1. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item

2. Find the closest distance (most similar) pair of clusters and merge them into a

single cluster, so that now you have one less cluster

3. Compute pairwise distances between the new cluster and each of the old 3. Compute pairwise distances between the new cluster and each of the old

clusters

4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N

5. Draw the dendogram, and with the complete hierarchical tree, if you want K

clusters you just have to cut the K-1 top links

Note any distance measure can be used: Euclidean, Manhattan, etc

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008

Page 88: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical clustering – an example• Assume X=[3 7 10 17 18 20]

1. There are 6 items, so create 6 clusters initially

2. Compute pairwise distances of clusters (assume Manhattan distance)

The closest clusters are 17 and 18 (with distance=1), so merge these two clusters

togethertogether

3. Repeat step 2 (assume single-linkage):

The closest clusters are cluster17,18 to cluster20 (with distance |18-20|=2), so merge

these two clusters together

Page 89: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical clustering – an example

(ctd)• Go on repeat cluster mergers until one big cluster remains

• Draw dendrogram (draw it in reverse of the cluster mergers) – remember

the height of each cluster corresponds to the manner of cluster the height of each cluster corresponds to the manner of cluster

agglomeration

89

Page 90: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Starting Situation

90

Page 91: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Intermediate Situation

91

Page 92: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Intermediate Situation

92

Page 93: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

After Merging

93

Page 94: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

How to Define Inter-Cluster Similarity

94

Page 95: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

How to Define Inter-Cluster Similarity

95

Page 96: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

How to Define Inter-Cluster Similarity

96

Page 97: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

How to Define Inter-Cluster Similarity

97

Page 98: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

How to Define Inter-Cluster Similarity

98

Page 99: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Cluster Similarity: MI N or Single Link

99

Page 100: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering: MIN

100

Page 101: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Strength of MIN

101

Page 102: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Limitations of MIN

102

Page 103: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Cluster Similarity: MAX or Complete Linkage

103

Page 104: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering: MAX

104

Page 105: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Strength of MAX

105

Page 106: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Limitations of MAX

106

Page 107: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Cluster Similarity: Group Average

107

Page 108: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering: Group Average

108

Page 109: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering: Group Average

• Compromise between Single and Complete

Link

• Strengths• Strengths

� Less susceptible to noise and outliers

• Limitations

� Biased towards globular clusters

109

Page 110: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Comparing agglomerative vs exclusive (partitioning)

clustering

• Agglomerative - advantages

� Preferable for detailed data analysis

� Provides more information than exclusive clustering

� We can decide on any number of clusters without the need to redo

the algorithm –in exclusive clustering, K has to be decided first, if a

different K is used, then need to redo the whole exclusive clustering

algorithmalgorithm

� One unique answer

• Disadvantages

� Less efficient than exclusive clustering

� No backtracking, i.e. can never undo previous steps

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008

Page 111: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Cluster Similarity: Ward’s Method

• Similarity of two clusters is based on the increase in squared error when two clusters are merged

� Similar to group average if distance between points is distance squaredSimilar to group average if distance between points is distance squared

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of K-means� Can be used to initialize K-means

111

Page 112: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering: Comparison

112

Page 113: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering: Time and Space requirements

113

Page 114: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Hierarchical Clustering: Problems and Limitations

114

Page 115: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

MST: Divisive Hierarchical Clustering

• Build MST (Minimum Spanning Tree)� Start with a tree that consists of any point

� In successive steps, look for the closest pair of points (p, q) such that

one point (p) is in the current tree but the other (q) is not

� Add q to the tree and put an edge between p and q� Add q to the tree and put an edge between p and q

Page 116: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

MST: Divisive Hierarchical Clustering

• Use MST for constructing hierarchy of clusters

Page 117: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as

density-connected points

• Major features:

� Discover clusters of arbitrary shape

� Handle noise

� One scan� One scan

� Need density parameters as termination condition

• Several interesting studies:

� DBSCAN: Ester, et al. (KDD’96)

� OPTICS: Ankerst, et al (SIGMOD’99).

� DENCLUE: Hinneburg & D. Keim (KDD’98)

� CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

117

Page 118: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Density-Based Clustering: Basic Concepts

• Two parameters:

� Eps: Maximum radius of the neighbourhood

� MinPts: Minimum number of points in an Eps-neighbourhood of that point

• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}

• Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

� p belongs to NEps(q)

� core point condition:

|NEps (q)| ≥ MinPts

MinPts = 5

Eps = 1 cm

p

q

118

Page 119: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Density-Reachable and Density-Connected

• Density-reachable:

� A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from p

p

qp1

reachable from pi

• Density-connected

� A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Epsand MinPts

p q

o

119

Page 120: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

DBSCAN: Density-Based Spatial Clustering of

Applications with Noise

• Relies on a density-based notion of cluster: A cluster is defined

as a maximal set of density-connected points

• Discovers clusters of arbitrary shape in spatial databases with

noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5 120

Page 121: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

DBSCAN

• DBSCAN is a density-based algorithm.� Density = number of points within a specified radius (Eps)

� A point is a core point if it has more than a specified number of

points (MinPts) within Eps

• These are points that are at the interior of a cluster

� A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

� A noise point is any point that is not a core point or a border point.

Page 122: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

DBSCAN: Core, Border, and Noise Points

Page 123: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

DBSCAN: The Algorithm

• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. Eps and

MinPts

• If p is a core point, a cluster is formed

• If p is a border point, no points are density-reachable from p

and DBSCAN visits the next point of the database

• Continue the process until all of the points have been

processed

• If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2)

123

Page 124: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

DBSCAN: Sensitive to Parameters

124

Page 125: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border

and noise

Eps = 10, MinPts = 4

Page 126: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Page 127: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

Page 128: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

DBSCAN: Determining EPS and MinPts

• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

• Noise points have the kth nearest neighbor at farther distance

• So, plot sorted distance of every point to its kth nearest neighbor

Page 129: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Cluster Validity

• For supervised classification we have a variety of measures to evaluate how good our model is� Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?� To avoid finding patterns in noise� To compare clustering algorithms� To compare two sets of clusters� To compare two clusters

Page 130: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clusters found in Random Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

yRand

om

Point

s 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y DBSC

AN

0 0.2 0.4 0.6 0.8 10

xs

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-

mea

ns

0 0.2 0.4 0.6 0.8 10

x

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Compl

ete

Link

Page 131: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

1. Determining the clustering tendency of a set of data, i.e., distinguishing

whether non-random structure actually exists in the data.

2. Comparing the results of a cluster analysis to externally known results,

e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without

reference to external information.

- Use only the data

Different Aspects of Cluster Validation

4. Comparing the results of two different sets of cluster analyses to

determine which is better.

5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate

the entire clustering or just individual clusters.

Page 132: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Numerical measures that are applied to judge various aspects of

cluster validity, are classified into the following three types.

� External Index: Used to measure the extent to which cluster labels match

externally supplied class labels.• Entropy

� Internal Index: Used to measure the goodness of a clustering structure

without respect to external information. • Sum of Squared Error (SSE)

Measures of Cluster Validity

• Sum of Squared Error (SSE)

� Relative Index: Used to compare two different clusterings or clusters. • Often an external or internal index is used for this function, e.g., SSE or entropy

• Sometimes these are referred to as criteria instead of indices

� However, sometimes criterion is the general strategy and index is the numerical

measure that implements the criterion.

Page 133: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Two matrices

� Proximity Matrix

� “Incidence” Matrix

• One row and one column for each data point

• An entry is 1 if the associated pair of points belong to the same cluster

• An entry is 0 if the associated pair of points belongs to different clusters

• Compute the correlation between the two matrices

Since the matrices are symmetric, only the correlation between

Measuring Cluster Validity Via Correlation

� Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

• High correlation indicates that points that belong to the same

cluster are close to each other.

• Not a good measure for some density or contiguity based

clusters.

Page 134: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Measuring Cluster Validity Via Correlation

• Correlation of incidence and proximity

matrices for the K-means clusterings of the

following two data sets. 0.8

0.9

1

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

xy

Corr = -0.9235 Corr = -0.5810

Page 135: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Order the similarity matrix with respect to cluster labels

and inspect visually.

Using Similarity Matrix for Cluster Validation

0.7

0.8

0.9

1

10

20

30

40 0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

x

y

Points

Po

ints

20 40 60 80 100

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

Page 136: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

Po

ints

10

20

30

40

50 0.5

0.6

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

y

Points

Po

ints

20 40 60 80 100

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

x

y

Page 137: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Po

ints

10

20

30

40

50 0.5

0.6

0.7

0.8

0.9

1

Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

0.5

0.6

0.7

0.8

0.9

1

y

Points

Po

ints

20 40 60 80 100

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

x

Page 138: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

0.5

0.6

0.7

0.8

0.9

1

y

Po

ints

10

20

30

40

50 0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

x

y

Points

Po

ints

20 40 60 80 100

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

Complete Link

Page 139: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Using Similarity Matrix for Cluster Validation

1 2

3

6

4

0.4

0.5

0.6

0.7

0.8

0.9

1

500

1000

1500

5

7

DBSCAN

0

0.1

0.2

0.3

0.4

500 1000 1500 2000 2500 3000

2000

2500

3000

Page 140: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Clusters in more complicated figures aren’t well separated

• Internal Index: Used to measure the goodness of a clustering

structure without respect to external information

� SSE

• SSE is good for comparing two clusterings or two clusters

(average SSE).

• Can also be used to estimate the number of clusters

Internal Measures: SSE

• Can also be used to estimate the number of clusters

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

Page 141: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Internal Measures: SSE

• SSE curve for a more complicated data set

1 2

3

6

3

5

4

7

SSE of clusters found using K-means

Page 142: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Need a framework to interpret any measure.

� For example, if our measure of evaluation has the value, 10, is that good, fair,

or poor?

• Statistics provide a framework for cluster validity

� The more “atypical” a clustering result is, the more likely it represents valid

structure in the data

� Can compare the values of an index that result from random data or

clusterings to those of a clustering result.

Framework for Cluster Validity

clusterings to those of a clustering result.

• If the value of the index is unlikely, then the cluster results are valid

� These approaches are more complicated and harder to understand.

• For comparing the results of two different sets of cluster

analyses, a framework is less necessary.

� However, there is the question of whether the difference between two index

values is significant

Page 143: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Example� Compare SSE of 0.005 against three clusters in random data

� Histogram shows SSE of three clusters in 500 sets of random data points

of size 100 distributed over the range 0.2 – 0.8 for x and y values

Statistical Framework for SSE

501

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

SSE

Cou

nt

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

x

y

Page 144: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Correlation of incidence and proximity matrices for the K-

means clusterings of the following two data sets.

Statistical Framework for Correlation

0.7

0.8

0.9

1

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

x

y

Corr = -0.9235 Corr = -0.5810

Page 145: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Cluster Cohesion: Measures how closely related

are objects in a cluster� Example: SSE

• Cluster Separation: Measure how distinct or

well-separated a cluster is from other clusters

Example: Squared Error

Internal Measures: Cohesion and Separation

• Example: Squared Error

� Cohesion is measured by the within cluster sum of squares (SSE)

� Separation is measured by the between cluster sum of squares

� Where |Ci| is the size of cluster i

∑ ∑∈

−=i Cx

ii

mxWSS 2)(

∑ −=i

ii mmCBSS 2)(

Page 146: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Internal Measures: Cohesion and Separation

• Example: SSE

� BSS + WSS = constant

1 2 3 4 5

× ××m1 m2

m

1091

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(22

2222

=+=

=−×+−×=

=−+−+−+−=

Total

BSS

WSSK=2 clusters:

10010

0)33(4

10)35()34()32()31(2

2222

=+=

=−×=

=−+−+−+−=

Total

BSS

WSSK=1 cluster:

Page 147: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• A proximity graph based approach can also be used for cohesion and

separation.

� Cluster cohesion is the sum of the weight of all links within a cluster.

� Cluster separation is the sum of the weights between nodes in the cluster and

nodes outside the cluster.

Internal Measures: Cohesion and Separation

cohesion separation

Page 148: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

• Silhouette Coefficient combine ideas of both cohesion and separation, but

for individual points, as well as clusters and clusterings

• For an individual point, i

� Calculate a = average distance of i to the points in its cluster

� Calculate b = min (average distance of i to points in another cluster)

� The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)

Internal Measures: Silhouette Coefficient

s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)

� Typically between 0 and 1.

� The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

ab

Page 149: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

External Measures of Cluster Validity: Entropy and Purity

Page 150: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience

Final Comment on Cluster Validity

analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Page 151: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

USER'S DILEMMA(สภาวะลาํบาก)

151

Page 152: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

User's Dilemma(สภาวะลาํบาก)

• What is a cluster?

• How to define pair-wise similarity?

• Which features and normalization scheme?

• How many clusters?• How many clusters?

• Which clustering method?

• Are the discovered clusters and partition valid?

• Does the data have any clustering tendency?

152R. Dubes and A. K. Jain, Clustering Techniques: User's Dilemma, PR 1976

Page 153: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Cluster Similarity?• Compact Clusters

� Within-cluster distance < between-cluster connectivity

• Connected Clusters

� Within-cluster connectivity > between-cluster

connectivity

• Ideal cluster:• Ideal cluster:

compact and isolated.

153

Page 154: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Representation (features)?

• There's no universal representation; they're domain dependent.

154

Page 155: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Good Representation

• A good representation leads to compact and isolated clusters.

155

Page 156: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

How do we weigh the features?

• Two different meaningful groupings produced by different weighting schemes.

156

Page 157: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

How do we decide the Number of Clusters?

• The samples are generated by 6 independent classes, yet:

157

Page 158: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Cluster Validity

• Clustering algorithms find clusters, even if

there are no natural clusters in the data.

158

Page 159: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Comparing Clustering Methods

• Which clustering algorithm is the best?

159

Page 160: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

There's no best Clustering Algorithm!

• Each algorithm imposes a structure on data.

• Good fit between model and data ⇒ success.

160

Page 161: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

MORE ON CLUSTERING

161

Page 162: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Overlapping clustering –Fuzzy C-

means algorithm• Both agglomerative and exclusive clustering allows one data to be in one

cluster only

• Fuzzy C-means (FCM) is a method of clustering which allows one piece of

data to belong to more than one cluster

• In other words, each data is a member of every cluster but with a certain

degree known as membership value

• This method (developed by Dunn in 1973 and improved by Bezdek in 1981)

is frequently used in pattern recognitionis frequently used in pattern recognition

• Fuzzy partitioning is carried out through an iterative procedure that

updates membership uij and the cluster centroids cj by

where m > 1, and represents the degree of fuzziness (typically, m=2)

162

=

==N

i

mij

N

ii

mij

j

u

xuc

1

11

2

1 ||||

||||

1

=∑

=mC

k ki

ji

ij

cx

cxu

Page 163: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Overlapping clusters?• Using both agglomerative and exclusive clustering methods, data X1 will be

member of cluster 1 only while X2 will be member of cluster 2 only

• However, using FCM, data X can be member of both clusters

• FCM uses distance measure too, so the further data is from that cluster centroid, the smaller the membership value will be

• For example, membership value for X1 from cluster 1, u11=0.73 and membership value for X1 from cluster 2, u12=0.27

• Similarly, membership value for X2 from cluster 2, u22=0.2 and membership value for X2 from cluster 1, u21=0.82 21

• Note: membership values are in the range 0 to 1 and membership values for each data from all the clusters will add to 1

163

Page 164: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Fuzzy C-means algorithm• Choose the number of clusters, C and m, typically 2

1. Initialise all uij, membership values randomly – matrix U0

2. At step k: Compute centroids, cj using

3. Compute new membership values, uij using

=

==N

i

mij

N

ii

mij

j

u

xuc

1

1

2

1

=iju3. Compute new membership values, uij using

4. Update Uk+1 � Uk

5. Repeat steps 2-4 until change of membership values is very small, Uk+1-

Uk <ε where ε is some small value, typically 0.01

Note: || means Euclidean distance, | means Manhattan

However, if the data is one dimensional (like the examples here), Euclidean distance

= Manhattan distance

164

1

2

1 ||||

|||| −

=∑

− mC

k ki

ji

cx

cx

Page 165: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Fuzzy C-means algorithm – an

example• X=[3 7 10 17 18 20] and assume C=2

• Initially, set U randomly

• Compute centroids, cj using , assume m=2

• c1=13.16; c2=11.81

5.09.07.04.08.09.0

5.01.03.06.02.01.0=U

=

==N

i

mij

N

ii

mij

j

u

xuc

1

1

• c1=13.16; c2=11.81

• Compute new membership values, uij using

• New U:

• Repeat centroid and membership computation until changes in

membership values are smaller than say 0.01

165

41.038.035.076.062.057.0

59.062.065.024.038.043.0=U

1

2

1 ||||

||||

1

=∑

=mC

k ki

ji

ij

cx

cxu

Page 166: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Fuzzy C-means algorithm –using

MATLAB• Using ‘fcm’ function in

MATLAB

• The final membership

values, U gives an indication

on similarity of each item to

the clusters

• For eg: item 3 (no. 10) is

more similar to cluster 1

compared to cluster 2 but

item 2 (no. 7) is even more

similar to cluster 1

166

Page 167: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Fuzzy C-means algorithm –using

MATLAB• ‘fcm’ function requires Fuzzy

Logic toolbox

• So, using MATLAB but without

‘fcm’ function:

∑N

m xu

167

=

==N

i

mij

ii

mij

j

u

xuc

1

1

1

2

1 ||||

||||

1

=∑

=mC

k ki

ji

ij

cx

cxu

Page 168: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clustering validity problem

• Problem 1

• A problem we face in clustering is to decide the optimal number of

clusters that fits a data set

• Problem 2

• The various clustering algorithms behave in a different way

depending ondepending on

� the features of the data set the features of the data set (geometry and

density distribution of clusters)

� the input parameters values (eg: for K-means, initial cluster choices

influence the result)

• So, how do we know which clustering method is better/suitable?

• We need a clustering quality criteria

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008

Page 169: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clustering validity problem

• In general, good clusters, should have

� High intra–cluster similarity, i.e. low variance among intra-cluster members

where variance for x is defined by

with as the mean of x

• For eg: if x=[2 4 6 8], then so var(x)=6.67

∑=

−−

=N

ii xx

Nx

1

2)(1

1)var(

x

5=x

• Computing intra-cluster similarity is simple

• For eg: for the clusters shown

• var(cluster1)=2.33 while var(cluster2)=12.33

• So, cluster 1 is better than cluster 2

• Note: use ‘var’ function in MATLAB to compute variance

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008

5=x

Page 170: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Clustering Quality Criteria• But this does not tell us anything about how good is the overall clustering or on the

suitable number of clusters needed!

• To solve this, we also need to compute inter-cluster variance

• Good clusters will also have low inter–cluster similarity, i.e. high variance among inter-cluster members in addition to high intra–cluster similarity, i.e. low variance among intra-cluster members

• One good measure of clustering quality is Davies-Bouldin• One good measure of clustering quality is Davies-Bouldinindex

• The others are

– Dunn’s Validity Index

– Silhouette method

– C–index

– Goodman–Kruskal index• So, we compute DB index for different number of clusters, K and the best value of DB

index tells us on the appropriate K value or on how good is the clustering method

170

Page 171: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Davies-Bouldin index• It is a function of the ratio of the sum of within-cluster (i.e. intra-cluster)

scatter to between cluster (i.e. inter-cluster) separation

• Because a low scatter and a high distance between clusters lead to low values of Rij , a minimization of DB index is desired

• Let C={C1,….., Ck} be a clustering of a set of N objects:

∑=

=k

iiR

kDB

1

.1

with and

where Ci is the ith cluster and ci is the centroid for cluster i

Numerator of Rij is a measure of intra-cluster similarity while the denominator is a measure of inter-cluster separation

Note, Rij=Rji

171

jikjiji RR≠=

=,,..1

max||||

)var()var(

ji

ji

jiij cc

CCR

+=

Page 172: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Davies-Bouldin index example

• For eg: for the clusters shown

• Compute

• var(C1)=0, var(C2)=4.5, var(C3)=2.33

• Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33

• So, R =1, R =0.152, R =0.797

||||

)var()var(

ji

ji

jiij cc

CCR

+=

≠ Note, variance of one element is

zero and centroid is simply the

element itself

• So, R12=1, R13=0.152, R23=0.797

• Now, compute

• R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31

and R32)

• Finally, compute

• DB=0.932

Lecture 7

slides for

172

jikjiji RR≠=

=,,..1

max

∑=

=k

iiR

kDB

1

.1

Page 173: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Davies-Bouldin index example (ctd)

• For eg: for the clusters shown

• Compute

• Only 2 clusters here

• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c =18.33

||||

)var()var(

ji

ji

jiij cc

CCR

+=

• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33

• R12=1.26

• Now compute

• Since we have only 2 clusters here, R1=R12=1.26; R2=R21=1.26

• Finally, compute

• DB=1.26

173

jikjiji RR≠=

=,,..1

max

∑=

=k

iiR

kDB

1

.1

Page 174: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Davies-Bouldin index example (ctd)

• DB with 2 clusters=1.26, with 3 clusters=0.932

• So, K=3 is better than K=2 (since DB smaller, better clusters)

• In general, we will repeat DB index computation for all cluster sizes from 2 to N-1

• So, if we have 10 data items, we will do clustering • So, if we have 10 data items, we will do clustering with K=2,…..9 and then compute DB for each value of K� K=10 is not done since each item is its own cluster

• Then, we decide the best clustering size (and the best set of clusters) would be the one with minimum values of DB index

174

Page 175: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Study Guide

At the end of this lecture, you should be able to

• Define clustering

• Name major clustering approaches and differentiate between them

• State the K-means algorithm and apply it on a given data set

• State the hierarchical algorithm and apply it on a given data • State the hierarchical algorithm and apply it on a given data set

• Compare exclusive (parititioning) and agglomerative clustering methods

• Identify major problems with clustering techniques

• Define and use cluster validity measures such as DB index on a given data set

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008

Page 176: Komate Amphawan CLUSTERINGkomate/886464... · Why Unsupervised Learning? [2] • Extend to a larger training set by using semi-supervised learning. Train a classier on a small set

Q & A176