Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 1

EE3J2 Data Mining

Lecture 11: Clustering

Martin Russell


Objectives

To explain the motivation for clustering To introduce the ideas of distance and distortion To describe agglomerative and divisive clustering To explain the relationships between clustering and

decision trees


Example from speech processing

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

Plot of high-frequency energy vs low-frequency energy, for 25 ms speech

segments, sampled every 10ms


Structure of data

Typical real data is not uniformly distrubuted It has structure Variables might be correlated The data might be grouped into natural ‘clusters’ The purpose of cluster analysis is to find this

underlying structure automatically


Clusters and centroids

If we assume that the clusters are spherical, then they are determined by their centres

The cluster centres are called centroids

How many centroids do we need?

Where should we put them? centroids


Distance

A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies:– d(x,x) = 0 for every point x

– d(x,y) = d(y,x) for all points x and y (d is symmetric)

– d(x,z) d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)


Example metrics

The most common metric is the Euclidean metric In this case, if x = (x1, x2,…,xN) and y = (y1,y2,…,yN)

then:

This corresponds to the standard notion of distance in Euclidean space

There are lots of others, but focus on this one

2222

211 ..., NN yxyxyxyxd


Distortion

Distortion is a measure of how well a set of centroids models a set of data

Suppose we have:– data points y1, y2,…,yT

– centroids c1,…,cM

For each data point yt let ci(t) be the closest centroid

In other words: d(yt, ci(t)) = minmd(yt,cm)


Distortion

The distortion for the centroid set C = c1,…,cM is defined by:

In other words, the distortion is the sum of distances between each data point and its nearest centroid

The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

T

ttit cydCDist

1

,


Types of Clustering

Initially we will look at two types of cluster analysis:– Agglomerative clustering, or ‘bottom-up’ clustering

– Divisive clustering, or ‘top-down’ clustering


Agglomerative clustering

Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster

Clusters are then combined until the required number of clusters is obtained

The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid


6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

Original data (302 points)


6

6

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

252 centroids


6

6

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

152 centroids


52 centroids

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 146

6


6

6

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

12 centroids


Divisive Clustering

Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points

That point is replaced with 2 new centroids Then each of these is replaced with 2 new centroids …


6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14



6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14



Decision tree interpretation

.

.

.

.

Single centroid - whole set

Multiple centroids – one per data point

Top down clustering -

divisive

Bottom up clustering -

agglomerative


Note on optimality

An ‘optimal’ set of centroids is one which minimises the distortion

None of these methods necessarily give optimal sets of centroids

Instead they give locally optimal sets of centroids Why?


Summary

Distance metrics and distortion Agglomerative clustering Divisive clustering Decision tree interpretation

Documents

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell