View
217
Download
2
Embed Size (px)
Citation preview
EE3J2 Data MiningSlide 1
EE3J2 Data Mining
Lecture 11: Clustering
Martin Russell
EE3J2 Data MiningSlide 2
Objectives
To explain the motivation for clustering To introduce the ideas of distance and distortion To describe agglomerative and divisive clustering To explain the relationships between clustering and
decision trees
EE3J2 Data MiningSlide 3
Example from speech processing
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 14
Plot of high-frequency energy vs low-frequency energy, for 25 ms speech
segments, sampled every 10ms
EE3J2 Data MiningSlide 4
Structure of data
Typical real data is not uniformly distrubuted It has structure Variables might be correlated The data might be grouped into natural ‘clusters’ The purpose of cluster analysis is to find this
underlying structure automatically
EE3J2 Data MiningSlide 5
Clusters and centroids
If we assume that the clusters are spherical, then they are determined by their centres
The cluster centres are called centroids
How many centroids do we need?
Where should we put them? centroids
EE3J2 Data MiningSlide 6
Distance
A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies:– d(x,x) = 0 for every point x
– d(x,y) = d(y,x) for all points x and y (d is symmetric)
– d(x,z) d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)
EE3J2 Data MiningSlide 7
Example metrics
The most common metric is the Euclidean metric In this case, if x = (x1, x2,…,xN) and y = (y1,y2,…,yN)
then:
This corresponds to the standard notion of distance in Euclidean space
There are lots of others, but focus on this one
2222
211 ..., NN yxyxyxyxd
EE3J2 Data MiningSlide 8
Distortion
Distortion is a measure of how well a set of centroids models a set of data
Suppose we have:– data points y1, y2,…,yT
– centroids c1,…,cM
For each data point yt let ci(t) be the closest centroid
In other words: d(yt, ci(t)) = minmd(yt,cm)
EE3J2 Data MiningSlide 9
Distortion
The distortion for the centroid set C = c1,…,cM is defined by:
In other words, the distortion is the sum of distances between each data point and its nearest centroid
The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised
T
ttit cydCDist
1
,
EE3J2 Data MiningSlide 10
Types of Clustering
Initially we will look at two types of cluster analysis:– Agglomerative clustering, or ‘bottom-up’ clustering
– Divisive clustering, or ‘top-down’ clustering
EE3J2 Data MiningSlide 11
Agglomerative clustering
Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster
Clusters are then combined until the required number of clusters is obtained
The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid
EE3J2 Data MiningSlide 12
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 14
Original data (302 points)
EE3J2 Data MiningSlide 13
6
6
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 14
252 centroids
EE3J2 Data MiningSlide 14
6
6
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 14
152 centroids
EE3J2 Data MiningSlide 15
52 centroids
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 146
6
EE3J2 Data MiningSlide 16
6
6
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 14
12 centroids
EE3J2 Data MiningSlide 17
Divisive Clustering
Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points
That point is replaced with 2 new centroids Then each of these is replaced with 2 new centroids …
EE3J2 Data MiningSlide 18
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 14
Original data (302 points)
EE3J2 Data MiningSlide 19
6
7
8
9
10
11
12
13
14
6 7 8 9 10 11 12 13 14
Original data (302 points)
EE3J2 Data MiningSlide 20
Decision tree interpretation
.
.
.
.
Single centroid - whole set
Multiple centroids – one per data point
Top down clustering -
divisive
Bottom up clustering -
agglomerative
EE3J2 Data MiningSlide 21
Note on optimality
An ‘optimal’ set of centroids is one which minimises the distortion
None of these methods necessarily give optimal sets of centroids
Instead they give locally optimal sets of centroids Why?
EE3J2 Data MiningSlide 22
Summary
Distance metrics and distortion Agglomerative clustering Divisive clustering Decision tree interpretation