lec18 · 2017. 11. 9. · Title: lec18 Created Date: 11/9/2017 2:38:35 AM

Department of Computer ScienceCSCI 5622: Machine Learning

Chenhao TanLecture 18: Clustering

Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

1

Learning objectives

• Learn about general clustering

• Learn about the K-Means algorithm

• Learn about Gaussian Mixture Models

2

Supervised learning

3

Unsupervised learning

Data: X Labels: Y Data: X

Latent structure: Z

Clustering

• One important unsupervised method is clustering

• Goal: Organize data in classes

4

Clustering applications – Microarray Gene Expression data

5

From: “Skin layer-specific transcriptional profiles in normal and recessive yellow (Mc1re/Mc1re) mice'' by April and Barsh in Pigment Cell Research (2006)

Clustering applications – Medical Imaging

6

Clustering applications – Community detection

7

News Media

8

Clustering



• Classes are hard to define

• Different data representation may lead to different clusterings

9

Clustering



• Data have high in-class similarity

• Data have low out-of-class similarity

10

Clustering - Similarity

11

Clustering - Similarity

12

K-Means

• Simplest clustering method• Iterative in nature• Reasonably fast• Very popular in practice (though with more bells and whistles)• Requires real-valued data

13

K-Means

14

K-Means

15

More K-means

• Animations:http://shabal.in/visuals/kmeans/4.html

23

K-Means in numbers

24

K-Means in numbers

25

K-Means in numbers

26

K-Means in numbers

27

K-Means in numbers

28

K-Means in numbers

29

K-Means in numbers

30

K-Means in numbers

31

K-Means in numbers

32

K-Means in numbers

33

K-Means in numbers

34

K-Means in numbers

35

K-Means

36

K-Means

37

K-Means

• Weaknesses• Doesn't really work with categorical data• Usually only converges to local minimum• Have to determine number of clusters• Can be sensitive to outliers• Only generates convex clusters

38

K-means - Weaknesses

• Doesn't really work with categorical data

39


• Doesn't really work with categorical data • Fix: Do K-Modes instead

40


• Usually only converges to local minimum

41


• Usually only converges to local minimum • Fix: Do several runs with random inits. and choose best

42


• Have to determine number of clusters

43


• Have to determine number of clusters• Fix: Use the elbow methodRun K-Means for different values of k and look at loss function

44

Gaussian Mixture Models

51


52


53


54


55


56


57


58


59

Recap

• K-means is the most commonly used clustering algorithm• We learned the Gaussian Mixture Model’s generative story

• We will learn EM-algorithm next week

60

Documents

lec18 · 2017. 11. 9. · Title: lec18 Created Date: 11/9/2017 2:38:35 AM