Unsupervised Learning I: K-Means...

Preview:

Citation preview

Unsupervised Learning I:

K-Means Clustering

Reading:

Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

(http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

Unsupervised learning = No labels on training examples!

Main approach: Clustering

Example: Optdigits data set

Optdigits features f1 f2 f3 f4 f5 f6 f7 f8

f9

x = (f1, f2, ..., f64) = (0, 2, 13, 16, 16, 16, 2, 0, 0, ...) Etc. ..

Partitional Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Partitional Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Hierarchical Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Hierarchical Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Hierarchical Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Issues for clustering algorithms

•  How to measure distance between pairs of instances?

•  How many clusters to create?

•  Should clusters be hierarchical? (I.e., clusters of clusters)

•  Should clustering be “soft”? (I.e., an instance can belong to different clusters, with “weighted belonging”)

Most commonly used (and simplest) clustering algorithm:

K-Means Clustering

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

K-means clustering algorithm

K-means clustering algorithm

Typically, use mean of points in cluster as centroid

K-means clustering algorithm

Distance metric: Chosen by user. For numerical attributes, often use L2 (Euclidean) distance. Centroid of a cluster here refers to the mean of the points in the cluster.

d(x, y) = (xi − yi )2

i=1

n

Example: Image segmentation by K-means clustering by color From http://vitroz.com/Documents/Image%20Segmentation.pdf

K=5, RGB space

K=10, RGB space

K=5, RGB space

K=10, RGB space

K=5, RGB space

K=10, RGB space

•  A text document is represented as a feature vector of word frequencies

•  Distance between two documents is the cosine of the angle between their corresponding feature vectors.

Clustering text documents

Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles.

Boyack KW, Newman D, Duhon RJ, Klavans R, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018029

Exercise 1

How to evaluate clusters produced by K-means?

•  Unsupervised evaluation

•  Supervised evaluation

Unsupervised Cluster Evaluation

We don’t know the classes of the data instances

Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let ci denote a cluster in C. Let |ci| denote the number of elements in ci. Clustering C:

How can we quantify how good a clustering this is?

|c1|= 9

|c2|= 6

|c3|= 6

c1 c2

c3

We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):

mse(c) =d(x,

x∈c∑ µc )

2

| c |

Average mse (C) =mse(c)

c∈C∑

K

c1 c2

c3

We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):

mse(c) =d(x,

x∈c∑ µc )

2

| c |

Average mse (C) =mse(c)

c∈C∑

K Note: The assigned reading uses sum square error rather than mean square error.

c1 c2

c3

We also want to maximize pairwise separation of each cluster.

c1 c2

c3

We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):

mss (C) =d(µi,µ j

all distinct pairs of clusters i, j∈C (i≠ j )∑ )2

K(K −1) / 2

c1 c2

c3

We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):

mss (C) =d(µi,µ j

all distinct pairs of clusters i, j∈C (i≠ j )∑ )2

K(K −1) / 2

c1 c2

c3

Exercises 2-3

Supervised Cluster Evaluation

Suppose we know the classes of the data instances: black, blue, green

c1 c2

c3

Supervised Cluster Evaluation

Suppose we know the classes of the data instances: black, blue, green

Entropy of a cluster: The degree to which a cluster consists of objects of a single class.

c1 c2

c3

entropy(ci ) = − pi, jj=1

|Classes|

∑ log2 pi, j

wherepi, j = probability that a member of cluster i belongs to class j

=mi, j

mi

, where mi, j is the number of instances in cluster i with class j

and mi is the number of instances in cluster i

c1 c2

c3

entropy(c1) = −39log2

39+49log2

49+29log2

29

⎝⎜

⎠⎟=1.74

entropy(c2 ) = −16log2

16+46log2

46+16log2

16

⎝⎜

⎠⎟= 0.78

entropy(c3) = −46log2

46+26log2

26

⎝⎜

⎠⎟= 0.45

Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: where mi is the number of instances in cluster ci and m is the total number of instances in the clustering.

mean entropy(C) = mi

mi=1

K

∑ entropy(ci )

Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: where mi is the number of instances in cluster ci and m is the total number of instances in the clustering.

mean entropy(C) = mi

mi=1

K

∑ entropy(ci )

c1 c2

c3

entropy(c1) = −39log2

39+49log2

49+29log2

29

⎝⎜

⎠⎟=1.74

entropy(c2 ) = −16log2

16+46log2

46+16log2

16

⎝⎜

⎠⎟= 0.78

entropy(c3) = −46log2

46+26log2

26

⎝⎜

⎠⎟= 0.45

Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: where mi is the number of instances in cluster ci and m is the total number of instances in the clustering.

mean entropy(C) = mi

mi=1

K

∑ entropy(ci )

c1 c2

c3

entropy(c1) = −39log2

39+49log2

49+29log2

29

⎝⎜

⎠⎟=1.74

entropy(c2 ) = −16log2

16+46log2

46+16log2

16

⎝⎜

⎠⎟= 0.78

entropy(c3) = −46log2

46+26log2

26

⎝⎜

⎠⎟= 0.45

mean entropy(C) = 921

(1.74)+ 621

(0.78)+ 621

(0.45) =1.1

What is the mean entropy of this clustering?

c1 c2

c3

Entropy Exercise

Issues for K-means

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

•  The user needs to specify K.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

•  The user needs to specify K.

•  The algorithm is sensitive to outliers –  Outliers are data points that are very far away from other

data points.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

•  The user needs to specify K.

•  The algorithm is sensitive to outliers –  Outliers are data points that are very far away from other

data points. –  Outliers could be errors in the data recording or some

special data points with very different values.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

CS583, Bing Liu, UIC

Issues for K-means: Problems with outliers

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Dealing with outliers •  One method is to remove some data points in the clustering

process that are much further away from the centroids than other data points. –  Expensive –  Not always a good idea!

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Dealing with outliers •  One method is to remove some data points in the clustering

process that are much further away from the centroids than other data points. –  Expensive –  Not always a good idea!

•  Another method is to perform random sampling. Since in

sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. –  Assign the rest of the data points to the clusters by distance or

similarity comparison, or classification

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

CS583, Bing Liu, UIC

Issues for K-means (cont …) •  The algorithm is sensitive to initial seeds.

+

+

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

CS583, Bing Liu, UIC

•  If we use different seeds: good results

+ +

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

CS583, Bing Liu, UIC

•  If we use different seeds: good results Often can improve K-means results by doing several random restarts.

+ +

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

CS583, Bing Liu, UIC

•  If we use different seeds: good results Often can improve K-means results by doing several random restarts.

+ +

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

Often useful to select instances from data as initial seeds.

CS583, Bing Liu, UIC

•  The K-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).

+

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

Other Issues •  What if a cluster is empty?

–  Choose a replacement centroid

•  At random, or

•  From cluster that has highest mean square error

•  How to choose K ?

•  The assigned reading discusses several methods for improving a clustering with “postprocessing”.

Choosing the K in K-Means •  Hard problem! Often no “correct” answer for unlabeled data

•  Many proposed methods! Here are a few:

•  Try several values of K, see which is best, via cross-validation. –  Metrics: mean square error, mean square separation, penalty for too

many clusters [why?]

•  Start with K = 2. Then try splitting each cluster. –  New means are one sigma away from cluster center in direction of greatest

variation.

–  Use similar metrics to above.

•  “Elbow” method: –  Plot average MSE vs. K. Choose K at which MSE (or other

metric) stops decreasing abruptly.

–  However, sometimes no clear “elbow”

“elbow”

Recommended