60
Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 18: Clustering Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1

lec18 · 2017. 11. 9. · Title: lec18 Created Date: 11/9/2017 2:38:35 AM

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Department of Computer ScienceCSCI 5622: Machine Learning

    Chenhao TanLecture 18: Clustering

    Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

    1

  • Learning objectives

    • Learn about general clustering

    • Learn about the K-Means algorithm

    • Learn about Gaussian Mixture Models

    2

  • Supervised learning

    3

    Unsupervised learning

    Data: X Labels: Y Data: X

    Latent structure: Z

  • Clustering

    • One important unsupervised method is clustering

    • Goal: Organize data in classes

    4

  • Clustering applications – Microarray Gene Expression data

    5

    From: “Skin layer-specific transcriptional profiles in normal and recessive yellow (Mc1re/Mc1re) mice'' by April and Barsh in Pigment Cell Research (2006)

  • Clustering applications – Medical Imaging

    6

  • Clustering applications – Community detection

    7

  • News Media

    8

  • Clustering

    • One important unsupervised method is clustering

    • Goal: Organize data in classes

    • Classes are hard to define

    • Different data representation may lead to different clusterings

    9

  • Clustering

    • One important unsupervised method is clustering

    • Goal: Organize data in classes

    • Data have high in-class similarity

    • Data have low out-of-class similarity

    10

  • Clustering - Similarity

    11

  • Clustering - Similarity

    12

  • K-Means

    • Simplest clustering method• Iterative in nature• Reasonably fast• Very popular in practice (though with more bells and whistles)• Requires real-valued data

    13

  • K-Means

    14

  • K-Means

    15

  • 16

  • 17

  • 18

  • 19

  • 20

  • 21

  • 22

  • More K-means

    • Animations:http://shabal.in/visuals/kmeans/4.html

    23

  • K-Means in numbers

    24

  • K-Means in numbers

    25

  • K-Means in numbers

    26

  • K-Means in numbers

    27

  • K-Means in numbers

    28

  • K-Means in numbers

    29

  • K-Means in numbers

    30

  • K-Means in numbers

    31

  • K-Means in numbers

    32

  • K-Means in numbers

    33

  • K-Means in numbers

    34

  • K-Means in numbers

    35

  • K-Means

    36

  • K-Means

    37

  • K-Means

    • Weaknesses• Doesn't really work with categorical data• Usually only converges to local minimum• Have to determine number of clusters• Can be sensitive to outliers• Only generates convex clusters

    38

  • K-means - Weaknesses

    • Doesn't really work with categorical data

    39

  • K-means - Weaknesses

    • Doesn't really work with categorical data • Fix: Do K-Modes instead

    40

  • K-means - Weaknesses

    • Usually only converges to local minimum

    41

  • K-means - Weaknesses

    • Usually only converges to local minimum • Fix: Do several runs with random inits. and choose best

    42

  • K-means - Weaknesses

    • Have to determine number of clusters

    43

  • K-means - Weaknesses

    • Have to determine number of clusters• Fix: Use the elbow methodRun K-Means for different values of k and look at loss function

    44

  • 45

  • 46

  • 47

  • 48

  • 49

  • 50

  • Gaussian Mixture Models

    51

  • Gaussian Mixture Models

    52

  • Gaussian Mixture Models

    53

  • Gaussian Mixture Models

    54

  • Gaussian Mixture Models

    55

  • Gaussian Mixture Models

    56

  • Gaussian Mixture Models

    57

  • Gaussian Mixture Models

    58

  • Gaussian Mixture Models

    59

  • Recap

    • K-means is the most commonly used clustering algorithm• We learned the Gaussian Mixture Model’s generative story

    • We will learn EM-algorithm next week

    60