Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Department of Computer ScienceCSCI 5622: Machine Learning
Chenhao TanLecture 18: Clustering
Slides adapted from Jordan Boyd-Graber, Chris Ketelsen
1
Learning objectives
• Learn about general clustering
• Learn about the K-Means algorithm
• Learn about Gaussian Mixture Models
2
Supervised learning
3
Unsupervised learning
Data: X Labels: Y Data: X
Latent structure: Z
Clustering
• One important unsupervised method is clustering
• Goal: Organize data in classes
4
Clustering applications – Microarray Gene Expression data
5
From: “Skin layer-specific transcriptional profiles in normal and recessive yellow (Mc1re/Mc1re) mice'' by April and Barsh in Pigment Cell Research (2006)
Clustering applications – Medical Imaging
6
Clustering applications – Community detection
7
News Media
8
Clustering
• One important unsupervised method is clustering
• Goal: Organize data in classes
• Classes are hard to define
• Different data representation may lead to different clusterings
9
Clustering
• One important unsupervised method is clustering
• Goal: Organize data in classes
• Data have high in-class similarity
• Data have low out-of-class similarity
10
Clustering - Similarity
11
Clustering - Similarity
12
K-Means
• Simplest clustering method• Iterative in nature• Reasonably fast• Very popular in practice (though with more bells and whistles)• Requires real-valued data
13
K-Means
14
K-Means
15
16
17
18
19
20
21
22
More K-means
• Animations:http://shabal.in/visuals/kmeans/4.html
23
K-Means in numbers
24
K-Means in numbers
25
K-Means in numbers
26
K-Means in numbers
27
K-Means in numbers
28
K-Means in numbers
29
K-Means in numbers
30
K-Means in numbers
31
K-Means in numbers
32
K-Means in numbers
33
K-Means in numbers
34
K-Means in numbers
35
K-Means
36
K-Means
37
K-Means
• Weaknesses• Doesn't really work with categorical data• Usually only converges to local minimum• Have to determine number of clusters• Can be sensitive to outliers• Only generates convex clusters
38
K-means - Weaknesses
• Doesn't really work with categorical data
39
K-means - Weaknesses
• Doesn't really work with categorical data • Fix: Do K-Modes instead
40
K-means - Weaknesses
• Usually only converges to local minimum
41
K-means - Weaknesses
• Usually only converges to local minimum • Fix: Do several runs with random inits. and choose best
42
K-means - Weaknesses
• Have to determine number of clusters
43
K-means - Weaknesses
• Have to determine number of clusters• Fix: Use the elbow methodRun K-Means for different values of k and look at loss function
44
45
46
47
48
49
50
Gaussian Mixture Models
51
Gaussian Mixture Models
52
Gaussian Mixture Models
53
Gaussian Mixture Models
54
Gaussian Mixture Models
55
Gaussian Mixture Models
56
Gaussian Mixture Models
57
Gaussian Mixture Models
58
Gaussian Mixture Models
59
Recap
• K-means is the most commonly used clustering algorithm• We learned the Gaussian Mixture Model’s generative story
• We will learn EM-algorithm next week
60