Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
CLUSTERING
Komate Amphawan
1
Introduction
• Until now, we've assumed our training samples are “labeled” by their category membership.
• Methods that use labeled samples are said to be supervised; otherwise, they're said to be unsupervised.unsupervised.
• However:
� Why would one even be interested in learning with unlabeled samples?
� Is it even possible in principle to learn anything of value from unlabeled samples?
2
Why Unsupervised Learning? [1]
• Collecting and labeling a large set of sample
patterns can be surprisingly costly.
� E.g., videos are virtually free, but accurately
labeling the video pixels is expensive and time labeling the video pixels is expensive and time
consuming.
3
Why Unsupervised Learning? [2]
• Extend to a larger training set by using semi-
supervised learning.
� Train a classier on a small set of samples, then
tune it up to make it run without supervision on a tune it up to make it run without supervision on a large, unlabeled set.
� Or, in the reverse direction, let a large set of
unlabeled data group automatically, then label the groupings found.
4
Why Unsupervised Learning? [3]
• To detect the gradual change of pattern over
time.
• To find features that will then be useful for
categorization.categorization.
• To gain insight into the nature or structure of
the data during the early stages of an investigation.
5
What is Cluster Analysis?
• Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groupsthe objects in other groups
6
Applications of Cluster Analysis
• Understanding
� Group related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar group stocks with similar
price fluctuations
• Summarization� Reduce the size of large
data sets
7
What is not Cluster Analysis?
• Supervised classification
� Have class label information
• Simple segmentation
� Dividing students into different registration groups � Dividing students into different registration groups alphabetically, by last name
• Results of a query
� Groupings are a result of an external specification
• Graph partitioning
� Some mutual relevance and synergy, but areas are not identical
8
Why data clustering?
• Natural Classification: degree of similarity among forms.
• Data exploration: discover underlying structure, generate hypotheses, detect anomalies.
• Compression: for organizing data.• Compression: for organizing data.
• Applications: can be used by any scientific field that collects data!
• Google Scholar: 1500 clustering papers in 2007 alone!
9A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice Hall, 1988
E.g.: Structure Discovering via Clustering
10http://clusty.com
E.g.: Topic Discovery
• 800,000 scientific
papers clustered into
776 topics based on
how often the
papers were cited papers were cited
together by authors of other papers
11Map of Science, Nature, 2006
Clustering for Data Understanding and Applications [1]
• Information retrieval: document clustering
• Land use: Identification of areas of similar land
use in an earth observation databaseuse in an earth observation database
• Marketing: Help marketers discover distinct
groups in their customer bases, and then use
this knowledge to develop targeted marketing
programs
12
Clustering for Data Understanding and Applications [2]
• City-planning: Identifying groups of houses according to their house type, value, and geographical location
• Earth-quake studies: Observed earth quake • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research 13
Clustering as a Preprocessing Tool (Utility)
• Summarization:� Preprocessing for regression, PCA, classification, and
association analysis
• Compression:
� Image processing: vector quantization� Image processing: vector quantization
• Finding K-nearest Neighbors
� Localizing search to one or a small number of clusters
• Outlier detection
� Outliers are often viewed as those “far away” from any cluster
14
The Problem of Clustering
• Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that members of a
cluster are in some sense as close to each cluster are in some sense as close to each other as possible.
15
16
Notion of a Cluster can be Ambiguous
17
Similarity/Dissimilarity metric
18
Properties of distance metrices
19
TYPES OF CLUSTERING
20
Types of Clusterings
21
Partitional Clustering
22
Hierarchical Clustering
23
Other Distinctions Between Sets of Clusters [1]
• Exclusive versus non-exclusive
� In non-exclusive clusterings, points may belong to
multiple clusters.
� Can represent multiple classes or ‘border’ points� Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
� In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1
� Weights must sum to 1
� Probabilistic clustering has similar characteristics
24
Other Distinctions Between Sets of Clusters [2]
• Partial versus complete
� In some cases, we only want to cluster some of
the data
• Heterogeneous versus homogeneous• Heterogeneous versus homogeneous
� Cluster of widely different sizes, shapes, and densities
25
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters• Density-based clusters
• Property or Conceptual
• Described by an Objective Function
26
Types of Clusters: Well-Separated
• Well-Separated Clusters:
� A cluster is a set of points such that any point in a
cluster is closer (or more similar) to every other
point in the cluster than to any point not in the point in the cluster than to any point not in the cluster.
27
Types of Clusters: Center-Based
• Center-based
� A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
� The center of a cluster is often a centroid, the average � The center of a cluster is often a centroid, the average
of all the points in the cluster, or a medoid, the most “representative” point of a cluster
28
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or
Transitive)
� A cluster is a set of points such that a point in a cluster
is closer (or more similar) to one or more other points is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
29
Types of Clusters: Density-Based
• Density-based
� A cluster is a dense region of points, which is
separated by low-density regions, from other regions
of high density.
� Used when the clusters are irregular or intertwined,
30
� Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
� Finds clusters that share some common property or represent a particular concept.
31
Types of Clusters: Objective Function
• Clusters Defined by an Objective Function
� Finds clusters that minimize or maximize an
objective function.
� Enumerate all possible ways of dividing the points � Enumerate all possible ways of dividing the points
into clusters and evaluate the `goodness' of each
potential set of clusters by using the given objective function. (NP Hard)
32
Types of Clusters: Ob jective Function …
� Can have global or local objectives.
• Hierarchical clustering algorithms typically have local
objectives
• Partitional algorithms typically have global objectives
� A variation of the global objective function approach
is to fit the data to a parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical distributions.
33
Types of Clusters: Objective Function …
• Map the clustering problem to a different domain and solve a related problem in that domain
� Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between weighted edges represent the proximities between points
� Clustering is equivalent to breaking the graph into connected components, one for each cluster.
� Want to minimize the edge weight between clusters and maximize the edge weig ht within clusters
34
Characteristics of the Input Data Are Important [1]
• Type of proximity or density measure
� This is a derived measure, but central to clustering
• Sparseness
� Dictates(ควบคมุ) type of similarity
� Adds to efficiency
• Attribute type
� Dictates type of similarity
35
Characteristics of the Input Data Are Important [2]
• Type of Data
� Dictates type of similarity
� Other characteristics, e.g., autocorrelation
• Dimensionality
• Noise and Outliers
• Type of Distribution
36
CLUSTERING ALGORITHMS
37
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
• Density-based clustering
38
39
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid
(center point)
• Each point is assigned to the cluster with the • Each point is assigned to the cluster with the
closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
40
• Initial centroids are often chosen randomly.
� Clusters produced vary from one run to another.� Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the
points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
41
K-means Clustering – Details
• K-means will converge for common similarity
measures mentioned above.
• Most of the convergence happens in the first
few iterations.few iterations.
� Often the stopping condition is changed to ‘Until
relatively few points change clusters’
• Complexity is O( n * K * I * d )
� n = number of points, K = number of clusters,
� I = number of iterations, d = number of attributes
42
EXAMPLE (ASK TEACHER TO EXPLAIN)
43
44
45
46
47
Example: image segmentation
• K-means cluster all pixels using their colour
vectors (3D)
• assign pixels to their clusters
• colour pixels by their cluster assignment• colour pixels by their cluster assignment
48
Example application 2: face clustering
• Determine the principal cast of a feature film
• Approach: view this as a clustering problem
on faces
• Algorithm outline
� Detect faces for every fifth frame in the movie
� Describe the face by a vector of intensities
� Cluster using a K-means algorithm
49
Example – “Ground Hog Day” 2000 frames
50
51
Two different K-means Clusterings
52
Importance of Choosing Initial Centroids
53
Importance of Choosing Initial Centroids
54
Evaluating K-means Clusters [1]
• Most common measure is Sum of Squared
Error (SSE)
� For each point, the error is the distance to the
nearest cluster
� To get SSE, we square these errors and sum them.� To get SSE, we square these errors and sum them.
� x is a data point in cluster Ci and mi is the
representative point for cluster Ci
• can show that mi corresponds to the center (mean) of the cluster 55
Evaluating K-means Clusters [2]
� Given two clusters, we can choose the one with
the smallest error
� One easy way to reduce SSE is to increase K, the
number of clustersnumber of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
56
Importance of Choosing Initial Centroids …
57
Importance of Choosing Initial Centroids …
58
Importance of Choosing Initial Centroids …
• For example, if K = 10, then
probability = 10!/1010= 0.00036
• Sometimes the initial centroids will readjust
themselves in ‘right’ way, and sometimes they themselves in ‘right’ way, and sometimes they
don’t
• Consider an example of five pairs of clusters
59
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of
selecting one centroid from each cluster is
small.
� Chance is relatively small when K is large� Chance is relatively small when K is large
� If clusters are the same size, n, then
60
10 Clusters Example
61
10 Clusters Example
62
10 Clusters Example: another initialization
63
10 Clusters Example: another initialization
64
Solutions to Initial Centroids Problem
• Multiple runs
� Helps, but probability is not on your side
• Sample and use hierarchical clustering to • Sample and use hierarchical clustering to
determine initial centroids
• Select more than k initial centroids and then
select among these initial centroids
65
Solutions to Initial Centroids Problem
• Select most widely separated
• Postprocessing
• Bisecting K-means
� Not as susceptible(มีความรู้สกึไว) to initialization issues
66
Handling Empty Clusters
• Basic K-means algorithm can yield empty
clusters
• Several strategies• Several strategies
� Choose the point that contributes most to SSE
� Choose a point from the cluster with the highest SSE
� If there are several empty clusters, the above can be repeated several times.
67
Updating Centers Incrementally
68
Pre-processing and Post-processing
69
Bisecting K-means
• Extension of the basic K-Means algorithm
• Basic idea: Initially split the data into two
cluster, then further split one of the clusters, cluster, then further split one of the clusters,
and so on, until there are K clusters
• Side-product: results in hierarchical clusters
70
Bisecting K-Means Algorithm
• Initialization: Set of clusters contains one cluster with all points
• Repeat until list of clusters contains K clusters
Remove cluster from listRemove cluster from list
For number of trials do:
Bisect cluster with basic K-Means
Select bisection with lowest total SSE
Add both clusters to list of cluster
71
Bisecting K-means Example
72
Bisecting K-Means Example
73
Bisecting K-Means
• Which cluster should be selected for
bisection?
� Cluster with largest SSE
� Largest cluster (in terms of number of points)� Largest cluster (in terms of number of points)
• The ‘trials’ in the bisecting K-Means algorithm
try different seed initializations (see basic K-Means)
74
Limitations of K-means
75
Limitations of K-means: Differing Sizes
76
Limitations of K-means: Differing Density
77
Limitations of K-means: Non-globular Shapes
78
Overcoming K-means Limitations
79
Overcoming K-means Limitations
80
Overcoming K-means Limitations
81
K-Means: Pros and Cons
• K-Means is a simple clustering algorithm
• Runs efficiently
• Susceptible to initialization (but bisecting K-
Means overcomes this to some extent)Means overcomes this to some extent)
• Limitations with respect to different size and
densities and non-globular shapes
• Difficult to predict the best value for K in advance
82
Hierarchical Clustering
83
Dendrogram (ctd)• While merging cluster one by one, we can draw a tree diagram known as
dendrogram
• Dendrograms are used to represent agglomerative clustering
• From dendrograms, we can get any number of clusters
• Eg: say we wish to have 2 clusters, then cut the top one link
� Cluster 1: q, r
� Cluster 2: x, y, z, p
• Similarly for 3 clusters, cut 2 top links• Similarly for 3 clusters, cut 2 top links
� Cluster 1: q, r
� Cluster 2: x, y, z
� Cluster 3: p
84
A dendrogram example
Strengths of Hierarchical Clustering
85
Hierarchical Clustering
86
Hierarchical clustering - algorithm
• Hierarchical clustering algorithm is a type of agglomerative clustering
• Given a set of N items to be clustered, hierarchical clustering algorithm:
1. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item
2. Find the closest distance (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster
3. Compute pairwise distances between the new cluster and each of the old 3. Compute pairwise distances between the new cluster and each of the old
clusters
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N
5. Draw the dendogram, and with the complete hierarchical tree, if you want K
clusters you just have to cut the K-1 top links
Note any distance measure can be used: Euclidean, Manhattan, etc
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008
Hierarchical clustering – an example• Assume X=[3 7 10 17 18 20]
1. There are 6 items, so create 6 clusters initially
2. Compute pairwise distances of clusters (assume Manhattan distance)
The closest clusters are 17 and 18 (with distance=1), so merge these two clusters
togethertogether
3. Repeat step 2 (assume single-linkage):
The closest clusters are cluster17,18 to cluster20 (with distance |18-20|=2), so merge
these two clusters together
Hierarchical clustering – an example
(ctd)• Go on repeat cluster mergers until one big cluster remains
• Draw dendrogram (draw it in reverse of the cluster mergers) – remember
the height of each cluster corresponds to the manner of cluster the height of each cluster corresponds to the manner of cluster
agglomeration
89
Starting Situation
90
Intermediate Situation
91
Intermediate Situation
92
After Merging
93
How to Define Inter-Cluster Similarity
94
How to Define Inter-Cluster Similarity
95
How to Define Inter-Cluster Similarity
96
How to Define Inter-Cluster Similarity
97
How to Define Inter-Cluster Similarity
98
Cluster Similarity: MI N or Single Link
99
Hierarchical Clustering: MIN
100
Strength of MIN
101
Limitations of MIN
102
Cluster Similarity: MAX or Complete Linkage
103
Hierarchical Clustering: MAX
104
Strength of MAX
105
Limitations of MAX
106
Cluster Similarity: Group Average
107
Hierarchical Clustering: Group Average
108
Hierarchical Clustering: Group Average
• Compromise between Single and Complete
Link
• Strengths• Strengths
� Less susceptible to noise and outliers
• Limitations
� Biased towards globular clusters
109
Comparing agglomerative vs exclusive (partitioning)
clustering
• Agglomerative - advantages
� Preferable for detailed data analysis
� Provides more information than exclusive clustering
� We can decide on any number of clusters without the need to redo
the algorithm –in exclusive clustering, K has to be decided first, if a
different K is used, then need to redo the whole exclusive clustering
algorithmalgorithm
� One unique answer
• Disadvantages
� Less efficient than exclusive clustering
� No backtracking, i.e. can never undo previous steps
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008
Cluster Similarity: Ward’s Method
• Similarity of two clusters is based on the increase in squared error when two clusters are merged
� Similar to group average if distance between points is distance squaredSimilar to group average if distance between points is distance squared
• Less susceptible to noise and outliers
• Biased towards globular clusters
• Hierarchical analogue of K-means� Can be used to initialize K-means
111
Hierarchical Clustering: Comparison
112
Hierarchical Clustering: Time and Space requirements
113
Hierarchical Clustering: Problems and Limitations
114
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)� Start with a tree that consists of any point
� In successive steps, look for the closest pair of points (p, q) such that
one point (p) is in the current tree but the other (q) is not
� Add q to the tree and put an edge between p and q� Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
� Discover clusters of arbitrary shape
� Handle noise
� One scan� One scan
� Need density parameters as termination condition
• Several interesting studies:
� DBSCAN: Ester, et al. (KDD’96)
� OPTICS: Ankerst, et al (SIGMOD’99).
� DENCLUE: Hinneburg & D. Keim (KDD’98)
� CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
117
Density-Based Clustering: Basic Concepts
• Two parameters:
� Eps: Maximum radius of the neighbourhood
� MinPts: Minimum number of points in an Eps-neighbourhood of that point
• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if
� p belongs to NEps(q)
� core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
118
Density-Reachable and Density-Connected
• Density-reachable:
� A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from p
p
qp1
reachable from pi
• Density-connected
� A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Epsand MinPts
p q
o
119
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
• Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5 120
DBSCAN
• DBSCAN is a density-based algorithm.� Density = number of points within a specified radius (Eps)
� A point is a core point if it has more than a specified number of
points (MinPts) within Eps
• These are points that are at the interior of a cluster
� A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
� A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
• If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2)
123
DBSCAN: Sensitive to Parameters
124
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border
and noise
Eps = 10, MinPts = 4
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
• Varying densities
• High-dimensional data
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
Cluster Validity
• For supervised classification we have a variety of measures to evaluate how good our model is� Accuracy, precision, recall
• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?� To avoid finding patterns in noise� To compare clustering algorithms� To compare two sets of clusters� To compare two clusters
Clusters found in Random Data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
yRand
om
Point
s 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y DBSC
AN
0 0.2 0.4 0.6 0.8 10
xs
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-
mea
ns
0 0.2 0.4 0.6 0.8 10
x
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Compl
ete
Link
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results,
e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
Different Aspects of Cluster Validation
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate
the entire clustering or just individual clusters.
• Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
� External Index: Used to measure the extent to which cluster labels match
externally supplied class labels.• Entropy
� Internal Index: Used to measure the goodness of a clustering structure
without respect to external information. • Sum of Squared Error (SSE)
Measures of Cluster Validity
• Sum of Squared Error (SSE)
� Relative Index: Used to compare two different clusterings or clusters. • Often an external or internal index is used for this function, e.g., SSE or entropy
• Sometimes these are referred to as criteria instead of indices
� However, sometimes criterion is the general strategy and index is the numerical
measure that implements the criterion.
• Two matrices
� Proximity Matrix
� “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
Since the matrices are symmetric, only the correlation between
Measuring Cluster Validity Via Correlation
� Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to the same
cluster are close to each other.
• Not a good measure for some density or contiguity based
clusters.
Measuring Cluster Validity Via Correlation
• Correlation of incidence and proximity
matrices for the K-means clusterings of the
following two data sets. 0.8
0.9
1
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
xy
Corr = -0.9235 Corr = -0.5810
• Order the similarity matrix with respect to cluster labels
and inspect visually.
Using Similarity Matrix for Cluster Validation
0.7
0.8
0.9
1
10
20
30
40 0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
x
y
Points
Po
ints
20 40 60 80 100
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
Po
ints
10
20
30
40
50 0.5
0.6
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
y
Points
Po
ints
20 40 60 80 100
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
x
y
Po
ints
10
20
30
40
50 0.5
0.6
0.7
0.8
0.9
1
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
0.5
0.6
0.7
0.8
0.9
1
y
Points
Po
ints
20 40 60 80 100
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
x
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
0.5
0.6
0.7
0.8
0.9
1
y
Po
ints
10
20
30
40
50 0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
x
y
Points
Po
ints
20 40 60 80 100
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
6
4
0.4
0.5
0.6
0.7
0.8
0.9
1
500
1000
1500
5
7
DBSCAN
0
0.1
0.2
0.3
0.4
500 1000 1500 2000 2500 3000
2000
2500
3000
• Clusters in more complicated figures aren’t well separated
• Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
� SSE
• SSE is good for comparing two clusterings or two clusters
(average SSE).
• Can also be used to estimate the number of clusters
Internal Measures: SSE
• Can also be used to estimate the number of clusters
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SS
E
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures: SSE
• SSE curve for a more complicated data set
1 2
3
6
3
5
4
7
SSE of clusters found using K-means
• Need a framework to interpret any measure.
� For example, if our measure of evaluation has the value, 10, is that good, fair,
or poor?
• Statistics provide a framework for cluster validity
� The more “atypical” a clustering result is, the more likely it represents valid
structure in the data
� Can compare the values of an index that result from random data or
clusterings to those of a clustering result.
Framework for Cluster Validity
clusterings to those of a clustering result.
• If the value of the index is unlikely, then the cluster results are valid
� These approaches are more complicated and harder to understand.
• For comparing the results of two different sets of cluster
analyses, a framework is less necessary.
� However, there is the question of whether the difference between two index
values is significant
• Example� Compare SSE of 0.005 against three clusters in random data
� Histogram shows SSE of three clusters in 500 sets of random data points
of size 100 distributed over the range 0.2 – 0.8 for x and y values
Statistical Framework for SSE
501
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340
5
10
15
20
25
30
35
40
45
SSE
Cou
nt
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
y
• Correlation of incidence and proximity matrices for the K-
means clusterings of the following two data sets.
Statistical Framework for Correlation
0.7
0.8
0.9
1
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
x
y
Corr = -0.9235 Corr = -0.5810
• Cluster Cohesion: Measures how closely related
are objects in a cluster� Example: SSE
• Cluster Separation: Measure how distinct or
well-separated a cluster is from other clusters
Example: Squared Error
Internal Measures: Cohesion and Separation
• Example: Squared Error
� Cohesion is measured by the within cluster sum of squares (SSE)
� Separation is measured by the between cluster sum of squares
� Where |Ci| is the size of cluster i
∑ ∑∈
−=i Cx
ii
mxWSS 2)(
∑ −=i
ii mmCBSS 2)(
Internal Measures: Cohesion and Separation
• Example: SSE
� BSS + WSS = constant
1 2 3 4 5
× ××m1 m2
m
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(22
2222
=+=
=−×+−×=
=−+−+−+−=
Total
BSS
WSSK=2 clusters:
10010
0)33(4
10)35()34()32()31(2
2222
=+=
=−×=
=−+−+−+−=
Total
BSS
WSSK=1 cluster:
• A proximity graph based approach can also be used for cohesion and
separation.
� Cluster cohesion is the sum of the weight of all links within a cluster.
� Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.
Internal Measures: Cohesion and Separation
cohesion separation
• Silhouette Coefficient combine ideas of both cohesion and separation, but
for individual points, as well as clusters and clusterings
• For an individual point, i
� Calculate a = average distance of i to the points in its cluster
� Calculate b = min (average distance of i to points in another cluster)
� The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)
Internal Measures: Silhouette Coefficient
s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)
� Typically between 0 and 1.
� The closer to 1 the better.
• Can calculate the Average Silhouette width for a cluster or a clustering
ab
External Measures of Cluster Validity: Entropy and Purity
“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience
Final Comment on Cluster Validity
analysis will remain a black art accessible only to those true believers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes
USER'S DILEMMA(สภาวะลาํบาก)
151
User's Dilemma(สภาวะลาํบาก)
• What is a cluster?
• How to define pair-wise similarity?
• Which features and normalization scheme?
• How many clusters?• How many clusters?
• Which clustering method?
• Are the discovered clusters and partition valid?
• Does the data have any clustering tendency?
152R. Dubes and A. K. Jain, Clustering Techniques: User's Dilemma, PR 1976
Cluster Similarity?• Compact Clusters
� Within-cluster distance < between-cluster connectivity
• Connected Clusters
� Within-cluster connectivity > between-cluster
connectivity
• Ideal cluster:• Ideal cluster:
compact and isolated.
153
Representation (features)?
• There's no universal representation; they're domain dependent.
154
Good Representation
• A good representation leads to compact and isolated clusters.
155
How do we weigh the features?
• Two different meaningful groupings produced by different weighting schemes.
156
How do we decide the Number of Clusters?
• The samples are generated by 6 independent classes, yet:
157
Cluster Validity
• Clustering algorithms find clusters, even if
there are no natural clusters in the data.
158
Comparing Clustering Methods
• Which clustering algorithm is the best?
159
There's no best Clustering Algorithm!
• Each algorithm imposes a structure on data.
• Good fit between model and data ⇒ success.
160
MORE ON CLUSTERING
161
Overlapping clustering –Fuzzy C-
means algorithm• Both agglomerative and exclusive clustering allows one data to be in one
cluster only
• Fuzzy C-means (FCM) is a method of clustering which allows one piece of
data to belong to more than one cluster
• In other words, each data is a member of every cluster but with a certain
degree known as membership value
• This method (developed by Dunn in 1973 and improved by Bezdek in 1981)
is frequently used in pattern recognitionis frequently used in pattern recognition
• Fuzzy partitioning is carried out through an iterative procedure that
updates membership uij and the cluster centroids cj by
where m > 1, and represents the degree of fuzziness (typically, m=2)
162
∑
∑
=
==N
i
mij
N
ii
mij
j
u
xuc
1
11
2
1 ||||
||||
1
−
=∑
−
−
=mC
k ki
ji
ij
cx
cxu
Overlapping clusters?• Using both agglomerative and exclusive clustering methods, data X1 will be
member of cluster 1 only while X2 will be member of cluster 2 only
• However, using FCM, data X can be member of both clusters
• FCM uses distance measure too, so the further data is from that cluster centroid, the smaller the membership value will be
• For example, membership value for X1 from cluster 1, u11=0.73 and membership value for X1 from cluster 2, u12=0.27
• Similarly, membership value for X2 from cluster 2, u22=0.2 and membership value for X2 from cluster 1, u21=0.82 21
• Note: membership values are in the range 0 to 1 and membership values for each data from all the clusters will add to 1
163
Fuzzy C-means algorithm• Choose the number of clusters, C and m, typically 2
1. Initialise all uij, membership values randomly – matrix U0
2. At step k: Compute centroids, cj using
3. Compute new membership values, uij using
∑
∑
=
==N
i
mij
N
ii
mij
j
u
xuc
1
1
2
1
−
=iju3. Compute new membership values, uij using
4. Update Uk+1 � Uk
5. Repeat steps 2-4 until change of membership values is very small, Uk+1-
Uk <ε where ε is some small value, typically 0.01
Note: || means Euclidean distance, | means Manhattan
However, if the data is one dimensional (like the examples here), Euclidean distance
= Manhattan distance
164
1
2
1 ||||
|||| −
=∑
−
− mC
k ki
ji
cx
cx
Fuzzy C-means algorithm – an
example• X=[3 7 10 17 18 20] and assume C=2
• Initially, set U randomly
• Compute centroids, cj using , assume m=2
• c1=13.16; c2=11.81
5.09.07.04.08.09.0
5.01.03.06.02.01.0=U
∑
∑
=
==N
i
mij
N
ii
mij
j
u
xuc
1
1
• c1=13.16; c2=11.81
• Compute new membership values, uij using
• New U:
• Repeat centroid and membership computation until changes in
membership values are smaller than say 0.01
165
41.038.035.076.062.057.0
59.062.065.024.038.043.0=U
1
2
1 ||||
||||
1
−
=∑
−
−
=mC
k ki
ji
ij
cx
cxu
Fuzzy C-means algorithm –using
MATLAB• Using ‘fcm’ function in
MATLAB
• The final membership
values, U gives an indication
on similarity of each item to
the clusters
• For eg: item 3 (no. 10) is
more similar to cluster 1
compared to cluster 2 but
item 2 (no. 7) is even more
similar to cluster 1
166
Fuzzy C-means algorithm –using
MATLAB• ‘fcm’ function requires Fuzzy
Logic toolbox
• So, using MATLAB but without
‘fcm’ function:
∑N
m xu
167
∑
∑
=
==N
i
mij
ii
mij
j
u
xuc
1
1
1
2
1 ||||
||||
1
−
=∑
−
−
=mC
k ki
ji
ij
cx
cxu
Clustering validity problem
• Problem 1
• A problem we face in clustering is to decide the optimal number of
clusters that fits a data set
• Problem 2
• The various clustering algorithms behave in a different way
depending ondepending on
� the features of the data set the features of the data set (geometry and
density distribution of clusters)
� the input parameters values (eg: for K-means, initial cluster choices
influence the result)
• So, how do we know which clustering method is better/suitable?
• We need a clustering quality criteria
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008
Clustering validity problem
• In general, good clusters, should have
� High intra–cluster similarity, i.e. low variance among intra-cluster members
where variance for x is defined by
with as the mean of x
• For eg: if x=[2 4 6 8], then so var(x)=6.67
∑=
−−
=N
ii xx
Nx
1
2)(1
1)var(
x
5=x
• Computing intra-cluster similarity is simple
• For eg: for the clusters shown
• var(cluster1)=2.33 while var(cluster2)=12.33
• So, cluster 1 is better than cluster 2
• Note: use ‘var’ function in MATLAB to compute variance
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008
5=x
Clustering Quality Criteria• But this does not tell us anything about how good is the overall clustering or on the
suitable number of clusters needed!
• To solve this, we also need to compute inter-cluster variance
• Good clusters will also have low inter–cluster similarity, i.e. high variance among inter-cluster members in addition to high intra–cluster similarity, i.e. low variance among intra-cluster members
• One good measure of clustering quality is Davies-Bouldin• One good measure of clustering quality is Davies-Bouldinindex
• The others are
– Dunn’s Validity Index
– Silhouette method
– C–index
– Goodman–Kruskal index• So, we compute DB index for different number of clusters, K and the best value of DB
index tells us on the appropriate K value or on how good is the clustering method
170
Davies-Bouldin index• It is a function of the ratio of the sum of within-cluster (i.e. intra-cluster)
scatter to between cluster (i.e. inter-cluster) separation
• Because a low scatter and a high distance between clusters lead to low values of Rij , a minimization of DB index is desired
• Let C={C1,….., Ck} be a clustering of a set of N objects:
∑=
=k
iiR
kDB
1
.1
with and
where Ci is the ith cluster and ci is the centroid for cluster i
Numerator of Rij is a measure of intra-cluster similarity while the denominator is a measure of inter-cluster separation
Note, Rij=Rji
171
jikjiji RR≠=
=,,..1
max||||
)var()var(
ji
ji
jiij cc
CCR
−
+=
≠
Davies-Bouldin index example
• For eg: for the clusters shown
• Compute
• var(C1)=0, var(C2)=4.5, var(C3)=2.33
• Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33
• So, R =1, R =0.152, R =0.797
||||
)var()var(
ji
ji
jiij cc
CCR
−
+=
≠ Note, variance of one element is
zero and centroid is simply the
element itself
• So, R12=1, R13=0.152, R23=0.797
• Now, compute
• R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31
and R32)
• Finally, compute
• DB=0.932
Lecture 7
slides for
172
jikjiji RR≠=
=,,..1
max
∑=
=k
iiR
kDB
1
.1
Davies-Bouldin index example (ctd)
• For eg: for the clusters shown
• Compute
• Only 2 clusters here
• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c =18.33
||||
)var()var(
ji
ji
jiij cc
CCR
−
+=
≠
• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33
• R12=1.26
• Now compute
• Since we have only 2 clusters here, R1=R12=1.26; R2=R21=1.26
• Finally, compute
• DB=1.26
173
jikjiji RR≠=
=,,..1
max
∑=
=k
iiR
kDB
1
.1
Davies-Bouldin index example (ctd)
• DB with 2 clusters=1.26, with 3 clusters=0.932
• So, K=3 is better than K=2 (since DB smaller, better clusters)
• In general, we will repeat DB index computation for all cluster sizes from 2 to N-1
• So, if we have 10 data items, we will do clustering • So, if we have 10 data items, we will do clustering with K=2,…..9 and then compute DB for each value of K� K=10 is not done since each item is its own cluster
• Then, we decide the best clustering size (and the best set of clusters) would be the one with minimum values of DB index
174
Study Guide
At the end of this lecture, you should be able to
• Define clustering
• Name major clustering approaches and differentiate between them
• State the K-means algorithm and apply it on a given data set
• State the hierarchical algorithm and apply it on a given data • State the hierarchical algorithm and apply it on a given data set
• Compare exclusive (parititioning) and agglomerative clustering methods
• Identify major problems with clustering techniques
• Define and use cluster validity measures such as DB index on a given data set
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008
Q & A176