131
MS1b Statistical Machine Learning and Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/smldm.html 1

Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

MS1bStatistical Machine Learning and Data Mining

Yee Whye TehDepartment of Statistics

Oxford

http://www.stats.ox.ac.uk/~teh/smldm.html

1

Page 2: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Course Information

I Course webpage:http://www.stats.ox.ac.uk/~teh/smldm.html

I Lecturer: Yee Whye TehI TA for Part C: Thibaut LienantI TA for MSc: Balaji Lakshminarayanan and Maria LomeliI Please subscribe to Google Group:

https://groups.google.com/forum/?hl=en-GB#!forum/smldm

I Sign up for course using sign up sheets.

2

Page 3: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Course StructureLectures

I 1400-1500 Mondays in Math Institute L4.I 1000-1100 Wednesdays in Math Institute L3.

Part C:I 6 problem sheets.I Classes: 1600-1700 Tuesdays (Weeks 3-8) in 1 SPR Seminar Room.I Due Fridays week before classes at noon in 1 SPR.

MSc:I 4 problem sheets.I Classes: Tuesdays (Weeks 3, 5, 7, 9) in 2 SPR Seminar Room.I Group A: 1400-1500, Group B: 1500-1600.I Due Fridays week before classes at noon in 1 SPR.I Practical: Week 5 and 7 (assessed) in 1 SPR Computing Lab.I Group A: 1400-1600, Group B: 1600-1800.

3

Page 4: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Course Aims

1. Have ability to use the relevant R packages to analyse data, interpretresults, and evaluate methods.

2. Have ability to identify and use appropriate methods and models for givendata and task.

3. Understand the statistical theory framing machine learning and datamining.

4. Able to construct appropriate models and derive learning algorithms forgiven data and task.

4

Page 5: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

What is Machine Learning?

sensory

What's out there?How does world work?

What's going to happen?What should i do?

data

5

Page 6: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

What is Machine Learning?

data

InformationStructurePredictionDecisionsActions

http://gureckislab.org 6

Page 7: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

What is Machine Learning?

Machine Learning

statistics

computerscience

cognitivescience

psychology

mathematics

engineeringoperationsresearch

physics

biologygenetics

businessfinance

7

Page 8: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

What is the Difference?

Traditional Problems in Applied StatisticsWell formulated question that we would like to answer.Expensive to gathering data and/or expensive to do computation.Create specially designed experiments to collect high quality data.

Current SituationInformation Revolution

I Improvements in computers and data storage devices.I Powerful data capturing devices.I Lots of data with potentially valuable information available.

8

Page 9: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

What is the Difference?

Data characteristicsI SizeI DimensionalityI ComplexityI MessyI Secondary sources

Focus on generalization performanceI Prediction on new dataI Action in new circumstancesI Complex models needed for good generalization.

Computational considerationsI Large scale and complex systems

9

Page 10: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Applications of Machine Learning

I Pattern Recognition

I Sorting ChequesI Reading License PlatesI Sorting EnvelopesI Eye/ Face/ Fingerprint Recognition

10

Page 11: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Applications of Machine Learning

I Business applicationsI Help companies intelligently find informationI Credit scoringI Predict which products people are going to buyI Recommender systemsI Autonomous trading

I Scientific applicationsI Predict cancer occurence/type and health of patients/personalized healthI Make sense of complex physical, biological, ecological, sociological models

11

Page 12: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Further Readings, News and Applications

Links are clickable in pdf. More recent news posted on course webpage.I Leo Breiman: Statistical Modeling: The Two CulturesI NY Times: RI NY Times: Career in StatisticsI NY Times: Data Mining in WalmartI NY Times: Big Data’s Impact In the WorldI Economist: Data, Data EverywhereI McKinsey: Big data: The Next Frontier for CompetitionI NY Times: Scientists See Promise in Deep-Learning ProgramsI New Yorker: Is “Deep Learning” a Revolution in Artificial Intelligence?

12

Page 13: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Types of Machine Learning

Unsupervised LearningUncover structure hidden in ‘unlabelled’ data.

I Given network of social interactions, find communities.I Given shopping habits for people using loyalty cards: find groups of

‘similar’ shoppers.I Given expression measurements of 1000s of genes for 1000s of patients,

find groups of functionally similar genes.

Goal: Hypothesis generation, visualization.

13

Page 14: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Types of Machine Learning

Supervised LearningA database of examples along with “labels” (task-specific).

I Given network of social interactions along with their browsing habits,predict what news might users find interesting.

I Given expression measurements of 1000s of genes for 1000s of patientsalong with an indicator of absence or presence of a specific cancer,predict if the cancer is present for a new patient.

I Given expression measurements of 1000s of genes for 1000s of patientsalong with survival length, predict survival time.

Goal: Prediction on new examples.

14

Page 15: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Types of Machine Learning

Semi-supervised LearningA database of examples, only a small subset of which are labelled.

Multi-task LearningA database of examples, each of which has multiple labels corresponding todifferent prediction tasks.

Reinforcement LearningAn agent acting in an environment, given rewards for performing appropriateactions, learns to maximize its reward.

15

Page 16: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

OxWaSP

Oxford-Warwick Centre for Doctoral Training in StatisticsI Programme aims to produce EuropeÕs future research leaders in

statistical methodology and computational statistics for modernapplications.

I 10 fully-funded (UK, EU) students a year (1 international).I Website for prospective students.I Deadline: January 24, 2014

16

Page 17: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Exploratory Data Analysis

NotationI Data consists of p measurements (variables/attributes) on n examples

(observations/cases)I X is a n× p-matrix with Xij := the j-th measurement for the i-th example

X =

x11 x12 . . . x1j . . . x1p

x21 x22 . . . x2j . . . x2p...

.... . .

.... . .

...xi1 xi2 . . . xij . . . xip...

.... . .

.... . .

...xn1 xn2 . . . xnj . . . xnp

I Denote the ith data item by xi ∈ Rp. (This is transpose of ith row of X)I Assume x1, . . . , xn are independently and identically distributed samples

of a random vector X over Rp.

17

Page 18: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Crabs Data (n = 200, p = 5)

Campbell (1974) studied rock crabs of the genus leptograpsus. One species,L. variegatus, had been split into two new species, previously grouped bycolour, orange and blue. Preserved specimens lose their colour, so it washoped that morphological differences would enable museum material to beclassified.

Data are available on 50 specimens of each sex of each species, collected onsight at Fremantle, Western Australia. Each specimen has measurements on:

I the width of the frontal lobe FL,I the rear width RW,I the length along the carapace midline CL,I the maximum width CW of the carapace, andI the body depth BD in mm.

in addition to colour (species) and sex.

18

Page 19: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Crabs Data I

## load package MASS containing the datalibrary(MASS)## look at datacrabs

## assign predictor and class variablesCrabs <- crabs[,4:8]Crabs.class <- factor(paste(crabs[,1],crabs[,2],sep=""))

## various plotsboxplot(Crabs)hist(Crabs$FL,col=’red’,breaks=20,xname=’Frontal Lobe Size (mm)’)hist(Crabs$RW,col=’red’,breaks=20,xname=’Rear Width (mm)’)hist(Crabs$CL,col=’red’,breaks=20,xname=’Carapace Length (mm)’)hist(Crabs$CW,col=’red’,breaks=20,xname=’Carapace Width (mm)’)hist(Crabs$BD,col=’red’,breaks=20,xname=’Body Depth (mm)’)plot(Crabs,col=unclass(Crabs.class))parcoord(Crabs)

19

Page 20: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Crabs datasp sex index FL RW CL CW BD

1 B M 1 8.1 6.7 16.1 19.0 7.02 B M 2 8.8 7.7 18.1 20.8 7.43 B M 3 9.2 7.8 19.0 22.4 7.74 B M 4 9.6 7.9 20.1 23.1 8.25 B M 5 9.8 8.0 20.3 23.0 8.26 B M 6 10.8 9.0 23.0 26.5 9.87 B M 7 11.1 9.9 23.8 27.1 9.88 B M 8 11.6 9.1 24.5 28.4 10.49 B M 9 11.8 9.6 24.2 27.8 9.710 B M 10 11.8 10.5 25.2 29.3 10.311 B M 11 12.2 10.8 27.3 31.6 10.912 B M 12 12.3 11.0 26.8 31.5 11.413 B M 13 12.6 10.0 27.7 31.7 11.414 B M 14 12.8 10.2 27.2 31.8 10.915 B M 15 12.8 10.9 27.4 31.5 11.016 B M 16 12.9 11.0 26.8 30.9 11.417 B M 17 13.1 10.6 28.2 32.3 11.018 B M 18 13.1 10.9 28.3 32.4 11.219 B M 19 13.3 11.1 27.8 32.3 11.320 B M 20 13.9 11.1 29.2 33.3 12.1

20

Page 21: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Univariate Boxplots

FL RW CL CW BD

1020

3040

50

21

Page 22: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Univariate HistogramsHistogram of Frontal Lobe Size

Frontal Lobe Size (mm)

Fre

quen

cy

10 15 20

05

1015

20

Histogram of Rear Width

Rear Width (mm)

Fre

quen

cy

6 8 12 16 20

05

1015

Histogram of Carapace Length

Carapace Length (mm)

Fre

quen

cy

15 25 35 45

05

1015

20

Histogram of Carapace Width

Carapace Width (mm)

Fre

quen

cy

20 30 40 50

05

1015

20

Histogram of Body Depth

Body Depth (mm)

Fre

quen

cy

10 15 20

05

1015

2025

30

22

Page 23: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Simple Pairwise Scatterplots

FL

6 8 10 12 14 16 18 20 20 30 40 50

1015

20

68

1012

1416

1820

RW

CL

1520

2530

3540

45

2030

4050

CW

10 15 20 15 20 25 30 35 40 45 10 15 20

1015

20

BD

23

Page 24: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Parallel Coordinate Plots

FL RW CL CW BD

24

Page 25: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Visualization and Dimensionality Reduction

These summary plots are helpful, but do not really help very much if thedimensionality of the data is high (a few dozen or thousands).Visualizing higher-dimensional problems:

I We are constrained to view data in 2 or 3 dimensionsI Look for ‘interesting’ projections of X into lower dimensionsI Hope that for large p, considering only k� p dimensions is just as

informative.

Dimensionality reductionI For each data item xi ∈ Rp, find a lower dimensional representation

zi ∈ Rk with k� p.I Preserve as much as possible the interesting statistical

properties/relationships of data items.

25

Page 26: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Principal Components Analysis (PCA)

I PCA considers interesting directions to be those with greatest variance.I A linear dimensionality reduction technique:I Finds an orthogonal basis v1, v2, . . . , vp for the data space such that

I The first principal component (PC) v1 is the direction of greatest variance ofdata.

I The second PC v2 is the direction orthogonal to v1 of greatest variance, etc.I The subspace spanned by the first k PCs represents the ’best’ k-dimensional

representation of the data.I The k-dimensional representation of xi is:

zi = V>xi =

k∑`=1

v>` xi

where V ∈ Rp×k.I For simplicity, we will assume from now on that our dataset is centred, i.e.

we subtract the average x̄ from each xi.

26

Page 27: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Principal Components Analysis (PCA)

I Our data set is an iid sample of a random vector X = [X1 . . .Xp]>.

I For the 1st PC, we seek a derived variable of the form

Z1 = v11X1 + v12X2 + · · ·+ v1pXp = v>1 X

where v1 = [v11, . . . , v1p]> ∈ Rp are chosen to maximise

Var(Z1).

To get a well defined problem, we fix

v>1 v1 = 1.

I The 2nd PC is chosen to be orthogonal with the 1st and is computed in asimilar way. It will have the largest variance in the remaining p− 1dimensions, etc.

27

Page 28: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Principal Components Analysis (PCA)

−4 −2 0 2 4

−4

−2

02

4

X1

X2

−4 −2 0 2 4

−4

−2

02

4

PC_1

PC

_2

28

Page 29: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Deriving the First Principal ComponentI Maximise, subject to v>1 v1 = 1:

Var(Z1) = Var(v>1 X) = v>1 Cov(X)v1 ≈ v>1 Sv1

where S ∈ Rp×p is the sample covariance matrix, i.e.

S =1

n− 1

n∑

i=1

(xi − x̄)(xi − x̄)> =1

n− 1

n∑

i=1

xix>i =1

n− 1X>X.

I Rewriting this as a constrained maximisation problem,

L (v1, λ1) = v>1 Sv1 − λ1(v>1 v1 − 1

).

I The corresponding vector of partial derivatives yields

∂L(v1, λ1)

∂v1= 2Sv1 − 2λ1v1.

I Setting this to zero reveals the eigenvector equation, i.e. v1 must be aneigenvector of S and λ1 the corresponding eigenvalue.

I Since v>1 Sv1 = λ1v>1 v1 = λ1, the 1st PC must be the eigenvectorassociated with the largest eigenvalue of S.

29

Page 30: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Deriving Subsequent Principal Components

I Proceed as before but include the additional constraint that the 2nd PCmust be orthogonal to the 1st PC:

L (v2, λ2, µ) = v>2 Sv2 − λ2(v>2 v2 − 1

)− µ

(v>1 v2

).

I Solving this shows that v2 must be the eigenvector of S associated withthe 2nd largest eigenvalue, and so on

I The eigenvalue decomposition of S is given by

S = VΛV>

where Λ is a diagonal matrix with eigenvalues

λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0

and V is a p× p orthogonal matrix whose columns are the p eigenvectorsof S, i.e. the principal components v1, . . . , vp.

30

Page 31: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Properties of the Principal ComponentsI PCs are uncorrelated

Cov(X>vi,X>vj) ≈ v>i Svj = 0 for i 6= j.

I The total sample variance is given by

p∑

i=1

Sii = λ1 + . . .+ λp,

so the proportion of total variance explained by the kth PC is

λk

λ1 + λ2 + . . .+ λpk = 1, 2, . . . , p

I S is a real symmetric matrix, so eigenvectors (principal components) areorthogonal.

I Derived variables Z1, . . . ,Zp have variances λ1, . . . , λp.

31

Page 32: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

R code

This is what we have had before:

library(MASS)Crabs <- crabs[,4:8]Crabs.class <- factor(paste(crabs[,1],crabs[,2],sep=""))plot(Crabs,col=unclass(Crabs.class))

Now perform PCA with function princomp. (Alternatively, solve for the PCsyourself using eigen or svd).

Crabs.pca <- princomp(Crabs,cor=FALSE)plot(Crabs.pca)pairs(predict(Crabs.pca),col=unclass(Crabs.class))

32

Page 33: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Original Crabs Data

FL

6 8 10 12 14 16 18 20 20 30 40 50

1015

20

68

1012

1416

1820

RW

CL

1520

2530

3540

45

2030

4050

CW

10 15 20 15 20 25 30 35 40 45 10 15 20

1015

20

BD

33

Page 34: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

PCA of Crabs Data

Comp.1

−2 −1 0 1 2 3 −1.0 −0.5 0.0 0.5 1.0

−20

−10

010

2030

−2

−1

01

23

Comp.2

Comp.3

−2

−1

01

2

−1.

0−

0.5

0.0

0.5

1.0

Comp.4

−20 −10 0 10 20 30 −2 −1 0 1 2 −0.5 0.0 0.5

−0.

50.

00.

5

Comp.5

34

Page 35: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

PC 2 vs PC 3

−2 −1 0 1 2 3

−2

−1

01

2

Comp.2

Com

p.3

35

Page 36: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

PCA on Face Images

http://vismod.media.mit.edu/vismod/demos/facerec/basic.html 36

Page 37: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

PCA on European Genetic Variation

http://www.nature.com/nature/journal/v456/n7218/full/nature07331.html 37

Page 38: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Comments on the use of PCA

I PCA commonly used to project data X onto the first k PCs giving thek-dimensional view of the data that best preserves the first two moments.

I Although PCs are uncorrelated, scatterplots sometimes reveal structuresin the data other than linear correlation.

I PCA commonly used for lossy compression of high dimensional data.I Emphasis on variance is where the weaknesses of PCA stem from:

I The PCs depend heavily on the units measurement. Where the data matrixcontains measurements of vastly differing orders of magnitude, the PC willbe greatly biased in the direction of larger measurement. It is thereforerecommended to calculate PCs from Corr(X) instead of Cov(X).

I Robustness to outliers is also an issue. Variance is affected by outlierstherefore so are PCs.

38

Page 39: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Eigenvalue Decomposition (EVD)

Eigenvalue decomposition plays a significant role in PCA. PCs areeigenvectors of S = 1

n−1 X>X and PCA properties are derived from those ofeigenvectors and eigenvalues.

I For any p× p symmetric matrix S, there exists p eigenvectors v1, . . . , vp

that are pairwise orthogonal and p associated eigenvalues λ1, . . . , λp

which satisfy the eigenvalue equation Svi = λivi ∀i.I S can be written as S = VΛV> where

I V = [v1, . . . , vp] is a p× p orthogonal matrixI Λ = diag {λ1, . . . , λp}I If S is a real-valued matrix, then the eigenvalues are real-valued as well,λi ∈ R ∀i

I To compute the PCA of a dataset X, we can:I First estimate the covariance matrix using the sample covariance S.I Compute the EVD of S using the R command eigen.

39

Page 40: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Singular Value Decomposition (SVD)

Though the EVD does not always exist, the singular value decomposition isanother matrix factorization technique that always exist, even for non-squarematrices.

I X can be written as X = UDV> whereI U is an n× n matrix with orthogonal columns.I D is a n× p matrix with decreasing non-negative elements on the diagonal

(the singular values) and zero off-diagonal elements.I V is a p× p matrix with orthogonal columns.

I SVD can be computed using very fast and numerically stable algorithms.The relevant R command is svd.

40

Page 41: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Some Properties of the SVDI Let X = UDV> be the SVD of the n× p data matrix X.I Note that

(n− 1)S = X>X = (UDV>)>(UDV>) = VD>U>UDV> = VD>DV>,

using orthogonality (U>U = In) of U.I The eigenvalues of S are thus the diagonal entries of 1

n−1 D2 and thecolumns of the orthogonal matrix V are the eigenvectors of S.

I We also have

XX> = (UDV>)(UDV>)> = UDV>VD>U> = UDD>U>,

using orthogonality (V>V = Ip) of V.I SVD also gives the optimal low-rank approximations of X:

minX̃‖X̃ − X‖2 s.t. X̃ has maximum rank r < n, p.

This problem can be solved by keeping only the r largest singular valuesof X, zeroing out the smaller singular values in the SVD.

41

Page 42: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Biplots

I PCA plots show the data items (as rows of X) in the PC space.I Biplots allow us to visualize the original variables (as columns X) in the

same plot.I As for PCA, we would like the geometry of the plot to preserve as much

of the covariance structure as possible.

42

Page 43: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

BiplotsRecall that X = [X1, . . . ,Xp]> and X = UDV> is the SVD of the data matrix.

I The PC projection of xi is:

zi = V>xi = DU>i = [D11Ui1, . . . ,DkkUik]>.

I The jth unit vector ej ∈ Rp points in the direction of Xj. Its PC projection isV>j = V>ej, the jth row of V.

I The projection of the variable indicates the weighting each PC gives tothe original variables.

I Dot products between the projections gives entries of the data matrix:

xij =

p∑

k=1

UikDkkVjk = 〈DU>i ,V>j 〉.

I Distance of projected points from projected variables gives originallocation.

I These relationships can be plotted in 2D by focussing on first two PCs.

43

Page 44: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Biplots

●●

●●

●●

−5 0 5

−6

−4

−2

02

46

X1

X2

44

Page 45: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

BiplotsI There are other projections we can consider for biplots:

xij =

p∑

k=1

UikDkkVjk = 〈DU>i ,V>j 〉 = 〈D1−αU>i ,D

αV>j 〉.

where 0 ≤ α ≤ 1. The α = 1 case has some nice properties.I Covariance of the projected points is:

1n− 1

n∑

i=1

U>i Ui =1

n− 1I.

Projected points are uncorrelated and dimensions are equi-variance.I The covariance between Xj and X` is:

Var(XjX`) =1

n− 1〈DV>j ,DV>` 〉

So the angle between the projected variables gives the correlation.I When using k < p PCs, quality depends on the proportion of variance

explained by the PCs.

45

Page 46: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Biplots

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

0.2

Comp.1C

omp.

2

1

2

3

4

5

6

7

8

9

10

11

12

1314

15

16

17

18

1920

21

22

23

24

25

26

27

28

29

30

3132 3334

3536

37

38

39

40

41

42

4344

45

46

47 48

49

50

51

52

53

54

5556

57

58

59

60

61

6263

64

65

6667

68

69

70

71

72

73

74

7576

77

7879

80

81

8283

84

85

86

87

88

89

90

9192

93

94

95

96

97

98

99

100

−20 −10 0 10 20 30

−20

−10

010

2030

X1

X2

pc <- princomp(x)biplot(pc,scale=0)biplot(pc,scale=1)

46

Page 47: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Iris Data

50 sample from 3 species of iris: iris setosa,versicolor, and virginica

Each measuring the length and widths ofboth sepal and petals

Collected by E. Anderson (1935) andanalysed by R.A. Fisher (1936)

Using again function princomp and biplot.

iris1 <- irisiris1 <- iris1[,-5]biplot(princomp(iris1,cor=T))

47

Page 48: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Iris Data

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

0.2

Comp.1

Com

p.2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

2425

26

27

28

29

3031

32

33

34

35

36

3738

39

40

41

42

43

44

45

46

47

48

49

50

51

52 53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

8182

8384

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125126

127

128

129

130

131

132

133134

135

136

137

138

139

140141142

143

144

145

146

147

148

149

150

−10 −5 0 5 10

−10

−5

05

10

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

48

Page 49: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

US Arrests Data

This data set contains statistics, in arrests per 100,000 residents for assault,murder, and rape in each of the 50 US states in 1973. Also given is thepercent of the population living in urban areas.

pairs(USArrests)usarrests.pca <- princomp(USArrests,cor=T)plot(usarrests.pca)

pairs(predict(usarrests.pca))biplot(usarrests.pca)

49

Page 50: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

US Arrests Data Pairs Plot

Murder

50 150 250

●● ●

●●

●●

●●

●●

●● ●

● ●

●●

● ●

●●

10 20 30 40

510

15

●● ●

●●

●●

●●

●●

5015

025

0

● ●

●●

●●

●●

●●

●Assault

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● UrbanPop

3050

7090

●●

●●

● ●

●●

●●

5 10 15

1020

3040

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

30 50 70 90

●●

●●

●●

●●

● ●

●●

●●●

●●

Rape

50

Page 51: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

US Arrests Data Biplot

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

AlabamaAlaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa

Kansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−5 0 5

−5

05

Murder

Assault

UrbanPop

Rape

51

Page 52: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Multidimensional Scaling

Suppose there are n points X in Rp, but we are only given the n× n matrix D ofinter-point distances.

Can we reconstruct X?

52

Page 53: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Multidimensional Scaling

Rigid transformations (translations, rotations and reflections) do not changeinter-point distances so cannot recover X exactly. However X can berecovered up to these transformations!

I Let dij = ‖xi − xj‖2 be the distance between points xi and xj.

d2ij = ‖xi − xj‖2

2

= (xi − xj)>(xi − xj)

= x>i xi + x>j xj − 2x>i xj

I Let B = XX> be the n× n matrix of dot-products, bij = x>i xj. The aboveshows that D can be computed from B.

I Some algebraic exercise shows that B can be recovered from D if weassume

∑ni=1 xi = 0.

53

Page 54: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Multidimensional Scaling

I If we knew X, then SVD gives X = UDV>. As X has rank k = min(n, p),we have at most k singular values in D and we can assume U ∈ Rn×k,D ∈ Rk×p and V ∈ Rp×p.

I The eigendecomposition of B is then:

B = XX> = UDD>U> = UΛU>.

I This eigendecomposition can be obtained from B without knowledge of X!I Let x̃>i = UiΛ

12 be the ith row of UΛ

12 . Pad x̃i with 0s so that it has length p.

x̃>i x̃j = UiΛU>j = bij = x>i xj

and we have found a set of vectors with dot-products given by B.I The vectors x̃i differs from xi only via the orthogonal matrix V so are

equivalent up to rotation and reflections.

54

Page 55: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

US City Flight Distances

We present a table of flying mileages between 10 American cities, distancescalculated from our 2-dimensional world. Using D as the starting point, metricMDS finds a configuration with the same distance matrix.

ATLA CHIG DENV HOUS LA MIAM NY SF SEAT DC0 587 1212 701 1936 604 748 2139 2182 543587 0 920 940 1745 1188 713 1858 1737 5971212 920 0 879 831 1726 1631 949 1021 1494701 940 879 0 1374 968 1420 1645 1891 12201936 1745 831 1374 0 2339 2451 347 959 2300604 1188 1726 968 2339 0 1092 2594 2734 923748 713 1631 1420 2451 1092 0 2571 2408 2052139 1858 949 1645 347 2594 2571 0 678 24422182 1737 1021 1891 959 2734 2408 678 0 2329543 597 1494 1220 2300 923 205 2442 2329 0

55

Page 56: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

US City Flight Distances

library(MASS)

us <- read.csv("http://www.stats.ox.ac.uk/~teh/teaching/smldm/data/uscities.csv")

## use classical MDS to find lower dimensional views of the data## recover X in 2 dimensions

us.classical <- cmdscale(d=us,k=2)

plot(us.classical)text(us.classical,labels=names(us))

56

Page 57: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

US City Flight Distances

−1000 −500 0 500 1000

−10

00−

500

050

010

00

ATLANTA

CHICAGO

DENVER

HOUSTON

LA

MIAMI

NY

SF

SEATTLE

DC

57

Page 58: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Lower-dimensional Reconstructions

In classical MDS derivation, we used all eigenvalues in theeigendecomposition of B to reconstruct

x̃i = UiΛ12 .

We can use only the largest k < min(n, p) eigenvalues and eigenvectors in thereconstruction, giving the ‘best’ k-dimensional view of the data.

This is analogous to PCA, where only the largest eigenvalues of X>X areused, and the smallest ones effectively suppressed.

Indeed, PCA and classical MDS are duals and yield effectively the sameresult.

58

Page 59: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Crabs Datalibrary(MASS)Crabs <- crabs[,4:8]Crabs.class <- factor(paste(crabs[,1],crabs[,2],sep=""))

crabsmds <- cmdscale(d= dist(Crabs),k=2)plot(crabsmds, pch=20, cex=2,col=unclass(Crabs.class))

●●●

●●

●●

● ●●● ●

● ●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●●

●●

−30 −20 −10 0 10 20

−3

−2

−1

01

2

MDS 1

MD

S 2

59

Page 60: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Crabs DataCompare with previous PCA analysis.Classical MDS solution corresponds to the first 2 PCs.

Comp.1

−2 −1 0 1 2 3 −1.0 −0.5 0.0 0.5 1.0

−20

−10

010

2030

−2

−1

01

23

Comp.2

Comp.3

−2

−1

01

2

−1.

0−

0.5

0.0

0.5

1.0

Comp.4

−20 −10 0 10 20 30 −2 −1 0 1 2 −0.5 0.0 0.5

−0.

50.

00.

5

Comp.5

60

Page 61: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Language dataPresence or absence of 2867 homologous traits in 87 Indo-Europeanlanguages.

> X[1:15,1:16]V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16

Irish_A 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Irish_B 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Welsh_N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0Welsh_C 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0Breton_List 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Breton_SE 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Breton_ST 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Romanian_List 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Vlach 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Italian 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Ladin 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Provencal 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0French 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Walloon 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0French_Creole_C 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

61

Page 62: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Language dataUsing MDS with non-metric scaling.

MDS (i.e. cmdscale) which minimizes (d2ij ! d̃2

ij)2. Sammon thereby puts

more weight on reproducing the separation of points which are close byforcing them apart. Projection by MDS(Jaccard/sammon) with cluster dis-covery by k-means (Jaccard): There is an obvious east to west (top-left to

!0.5 0.0 0.5

!0.6

!0.4

!0.2

0.0

0.2

0.4

0.6

Irish_A

Irish_B

Welsh_NWelsh_C

Breton_List

Breton_SE

Breton_ST

Romanian_ListVlach

Italian

Ladin

Provencal

French

Walloon

French_Creole_C

French_Creole_D

Sardinian_N

Sardinian_L

Sardinian_C

Spanish Portuguese_STBrazilian

Catalan

German_STPenn_Dutch

Dutch_List

Afrikaans

Flemish

Frisian

Swedish_UpSwedish_VL

Swedish_List

Danish

Riksmal

Icelandic_ST

Faroese

English_STTakitaki

Lithuanian_O

Lithuanian_STLatvian

Slovenian

Lusatian_L

Lusatian_U

Czech

Slovak

Czech_E

Ukrainian

Byelorussian

PolishRussian

MacedonianBulgarian

Serbocroatian

Gypsy_Gk

Singhalese

Kashmiri

Marathi

Gujarati

Panjabi_ST

Lahnda

Hindi

Bengali

Nepali_List

Khaskura

Greek_ML

Greek_MD

Greek_Mod

Greek_D

Greek_K

Armenian_Mod

Armenian_List

Ossetic

Afghan

WaziriPersian_ListTadzik

Baluchi

Wakhi

Albanian_T

Albanian_Top

Albanian_G

Albanian_K

Albanian_C

HITTITE

TOCHARIAN_A

TOCHARIAN_B

bottom-right) separation of languages in the MDS and the clusters in theMDS grouping agree with the clusters discovered by agglomerative clus-tering and k-means. The two clustering methods group languages slightlydifferently with k-means splitting the Germanic languages.

## (alternative/MDS) make a field to display the clusters

## use MDS - sammon does this nicely

di.sam <- sammon(D,magic=0.20000002,niter=1000,tol=1e-8)

eqscplot(di.sam$points,pch=km$cluster,col=km$cluster)

text(di.sam$points,labels=row.names(X),pos=4,col=km$cluster)

5

62

Page 63: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Varieties of MDS

Generally, MDS is a class of dimensionality reduction techniques whichrepresents data points x1, . . . , xn ∈ Rp in a lower-dimensional spacez1, . . . , zn ∈ Rk which tries to preserve inter-point (dis)similarities.

I It requires only the matrix D of pairwise dissimilarities

dij = d(xi, dj).

For example we can use Euclidean distance dij = ‖xi − xj‖2. Otherdissimilarities are possible. Conversely, it can use a matrix of similarities.

I MDS finds representations z1, . . . , zn ∈ Rk such that

d(xi, xj) ≈ d̃ij = d̃(zi, zj),

where d̃ represents dissimilarity in the reduced k-dimensional space, anddifferences in dissimilarities are measured by a stress function S(dij, d̃ij).

63

Page 64: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Varieties of MDSChoices of (dis)similarities and stress functions lead to different objectivefunctions and different algorithms.

I Classical - preserves similarities instead

S(Z) =∑

i6=j

(sij − 〈zi − z̄, zj − z̄〉)2

I Metric Shepard-Kruskal

S(Z) =∑

i6=j

(dij − ‖zi − zj‖2)2

I Sammon - preserves shorter distances more

S(Z) =∑

i 6=j

(dij − ‖zi − zj‖2)2

dij

I Non-Metric Shepard-Kruskal - ignores actual distance values, only ranks

S(Z) = ming increasing

i 6=j

(g(dij)− ‖zi − zj‖2)2

64

Page 65: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Nonlinear Dimensionality Reduction

Two aims of different varieties of MDS:I To visualize the (dis)similarities among items in a dataset, where these

(dis)disimilarities may not have Euclidean geometric interpretations.I To perform nonlinear dimensionality reduction.

Many high-dimensional datasets exhibit low-dimensional structure (“live on alow-dimensional menifold”).

65

Page 66: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Isomap

Isomap is a non-linear dimensional reduction technique based on classicalMDS. Differs from other MDSs in its estimate of distances dij.

1. Calculate distances dij for i, j = 1, . . . , n between all data points, using theEuclidean distance.

2. Form a graph G with the n samples as nodes, and edges between therespective K nearest neighbours.

3. Replace distances dij by shortest-path distance on graph dGij and perform

classical MDS, using these distances.

converts distances to inner products (17 ),which uniquely characterize the geometry ofthe data in a form that supports efficientoptimization. The global minimum of Eq. 1 isachieved by setting the coordinates yi to thetop d eigenvectors of the matrix !(DG) (13).

As with PCA or MDS, the true dimen-sionality of the data can be estimated fromthe decrease in error as the dimensionality ofY is increased. For the Swiss roll, whereclassical methods fail, the residual varianceof Isomap correctly bottoms out at d " 2(Fig. 2B).

Just as PCA and MDS are guaranteed,given sufficient data, to recover the truestructure of linear manifolds, Isomap is guar-anteed asymptotically to recover the true di-mensionality and geometric structure of astrictly larger class of nonlinear manifolds.Like the Swiss roll, these are manifolds

whose intrinsic geometry is that of a convexregion of Euclidean space, but whose ambi-ent geometry in the high-dimensional inputspace may be highly folded, twisted, orcurved. For non-Euclidean manifolds, such asa hemisphere or the surface of a doughnut,Isomap still produces a globally optimal low-dimensional Euclidean representation, asmeasured by Eq. 1.

These guarantees of asymptotic conver-gence rest on a proof that as the number ofdata points increases, the graph distancesdG(i, j) provide increasingly better approxi-mations to the intrinsic geodesic distancesdM(i, j), becoming arbitrarily accurate in thelimit of infinite data (18, 19). How quicklydG(i, j) converges to dM(i, j) depends on cer-tain parameters of the manifold as it lieswithin the high-dimensional space (radius ofcurvature and branch separation) and on the

density of points. To the extent that a data setpresents extreme values of these parametersor deviates from a uniform density, asymp-totic convergence still holds in general, butthe sample size required to estimate geodes-ic distance accurately may be impracticallylarge.

Isomap’s global coordinates provide asimple way to analyze and manipulate high-dimensional observations in terms of theirintrinsic nonlinear degrees of freedom. For aset of synthetic face images, known to havethree degrees of freedom, Isomap correctlydetects the dimensionality (Fig. 2A) and sep-arates out the true underlying factors (Fig.1A). The algorithm also recovers the knownlow-dimensional structure of a set of noisyreal images, generated by a human hand vary-ing in finger extension and wrist rotation(Fig. 2C) (20). Given a more complex dataset of handwritten digits, which does not havea clear manifold geometry, Isomap still findsglobally meaningful coordinates (Fig. 1B)and nonlinear structure that PCA or MDS donot detect (Fig. 2D). For all three data sets,the natural appearance of linear interpolationsbetween distant points in the low-dimension-al coordinate space confirms that Isomap hascaptured the data’s perceptually relevantstructure (Fig. 4).

Previous attempts to extend PCA andMDS to nonlinear data sets fall into twobroad classes, each of which suffers fromlimitations overcome by our approach. Locallinear techniques (21–23) are not designed torepresent the global structure of a data setwithin a single coordinate system, as we do inFig. 1. Nonlinear techniques based on greedyoptimization procedures (24–30) attempt todiscover global structure, but lack the crucialalgorithmic features that Isomap inheritsfrom PCA and MDS: a noniterative, polyno-mial time procedure with a guarantee of glob-al optimality; for intrinsically Euclidean man-

Fig. 2. The residualvariance of PCA (opentriangles), MDS [opentriangles in (A) through(C); open circles in (D)],and Isomap (filled cir-cles) on four data sets(42). (A) Face imagesvarying in pose and il-lumination (Fig. 1A).(B) Swiss roll data (Fig.3). (C) Hand imagesvarying in finger exten-sion and wrist rotation(20). (D) Handwritten“2”s (Fig. 1B). In all cas-es, residual variance de-creases as the dimen-sionality d is increased.The intrinsic dimen-sionality of the datacan be estimated bylooking for the “elbow”at which this curve ceases to decrease significantly with added dimensions. Arrows mark the true orapproximate dimensionality, when known. Note the tendency of PCA and MDS to overestimate thedimensionality, in contrast to Isomap.

Fig. 3. The “Swiss roll” data set, illustrating how Isomap exploits geodesicpaths for nonlinear dimensionality reduction. (A) For two arbitrary points(circled) on a nonlinear manifold, their Euclidean distance in the high-dimensional input space (length of dashed line) may not accuratelyreflect their intrinsic similarity, as measured by geodesic distance alongthe low-dimensional manifold (length of solid curve). (B) The neighbor-hood graph G constructed in step one of Isomap (with K " 7 and N "

1000 data points) allows an approximation (red segments) to the truegeodesic path to be computed efficiently in step two, as the shortestpath in G. (C) The two-dimensional embedding recovered by Isomap instep three, which best preserves the shortest path distances in theneighborhood graph (overlaid). Straight lines in the embedding (blue)now represent simpler and cleaner approximations to the true geodesicpaths than do the corresponding graph paths (red).

R E P O R T S

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2321

on

De

ce

mb

er

5,

20

08

w

ww

.scie

nce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

Examples from Tenenbaum et al. (2000).

66

Page 67: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Handwritten Characters

tion to geodesic distance. For faraway points,geodesic distance can be approximated byadding up a sequence of “short hops” be-tween neighboring points. These approxima-tions are computed efficiently by findingshortest paths in a graph with edges connect-ing neighboring data points.

The complete isometric feature mapping,or Isomap, algorithm has three steps, whichare detailed in Table 1. The first step deter-mines which points are neighbors on themanifold M, based on the distances dX (i, j)between pairs of points i, j in the input space

X. Two simple methods are to connect eachpoint to all points within some fixed radius !,or to all of its K nearest neighbors (15). Theseneighborhood relations are represented as aweighted graph G over the data points, withedges of weight dX(i, j) between neighboringpoints (Fig. 3B).

In its second step, Isomap estimates thegeodesic distances dM (i, j) between all pairsof points on the manifold M by computingtheir shortest path distances dG(i, j) in thegraph G. One simple algorithm (16 ) for find-ing shortest paths is given in Table 1.

The final step applies classical MDS tothe matrix of graph distances DG " {dG(i, j)},constructing an embedding of the data in ad-dimensional Euclidean space Y that bestpreserves the manifold’s estimated intrinsicgeometry (Fig. 3C). The coordinate vectors yi

for points in Y are chosen to minimize thecost function

E ! !#$DG% " #$DY%!L2 (1)

where DY denotes the matrix of Euclideandistances {dY(i, j) " !yi & yj!} and !A!L2

the L2 matrix norm '(i, j Ai j2 . The # operator

Fig. 1. (A) A canonical dimensionality reductionproblem from visual perception. The input consistsof a sequence of 4096-dimensional vectors, rep-resenting the brightness values of 64 pixel by 64pixel images of a face rendered with differentposes and lighting directions. Applied to N " 698raw images, Isomap (K" 6) learns a three-dimen-sional embedding of the data’s intrinsic geometricstructure. A two-dimensional projection is shown,with a sample of the original input images (redcircles) superimposed on all the data points (blue)and horizontal sliders (under the images) repre-senting the third dimension. Each coordinate axisof the embedding correlates highly with one de-gree of freedom underlying the original data: left-right pose (x axis, R " 0.99), up-down pose ( yaxis, R " 0.90), and lighting direction (slider posi-tion, R " 0.92). The input-space distances dX(i, j )given to Isomap were Euclidean distances be-tween the 4096-dimensional image vectors. (B)Isomap applied to N " 1000 handwritten “2”sfrom the MNIST database (40). The two mostsignificant dimensions in the Isomap embedding,shown here, articulate the major features of the“2”: bottom loop (x axis) and top arch ( y axis).Input-space distances dX(i, j ) were measured bytangent distance, a metric designed to capture theinvariances relevant in handwriting recognition(41). Here we used !-Isomap (with ! " 4.2) be-cause we did not expect a constant dimensionalityto hold over the whole data set; consistent withthis, Isomap finds several tendrils projecting fromthe higher dimensional mass of data and repre-senting successive exaggerations of an extrastroke or ornament in the digit.

R E P O R T S

22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org2320

on D

ecem

ber

5, 2008

ww

w.s

cie

ncem

ag.o

rgD

ow

nlo

aded fro

m

67

Page 68: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Faces

tion to geodesic distance. For faraway points,geodesic distance can be approximated byadding up a sequence of “short hops” be-tween neighboring points. These approxima-tions are computed efficiently by findingshortest paths in a graph with edges connect-ing neighboring data points.

The complete isometric feature mapping,or Isomap, algorithm has three steps, whichare detailed in Table 1. The first step deter-mines which points are neighbors on themanifold M, based on the distances dX (i, j)between pairs of points i, j in the input space

X. Two simple methods are to connect eachpoint to all points within some fixed radius !,or to all of its K nearest neighbors (15). Theseneighborhood relations are represented as aweighted graph G over the data points, withedges of weight dX(i, j) between neighboringpoints (Fig. 3B).

In its second step, Isomap estimates thegeodesic distances dM (i, j) between all pairsof points on the manifold M by computingtheir shortest path distances dG(i, j) in thegraph G. One simple algorithm (16 ) for find-ing shortest paths is given in Table 1.

The final step applies classical MDS tothe matrix of graph distances DG " {dG(i, j)},constructing an embedding of the data in ad-dimensional Euclidean space Y that bestpreserves the manifold’s estimated intrinsicgeometry (Fig. 3C). The coordinate vectors yi

for points in Y are chosen to minimize thecost function

E ! !#$DG% " #$DY%!L2 (1)

where DY denotes the matrix of Euclideandistances {dY(i, j) " !yi & yj!} and !A!L2

the L2 matrix norm '(i, j Ai j2 . The # operator

Fig. 1. (A) A canonical dimensionality reductionproblem from visual perception. The input consistsof a sequence of 4096-dimensional vectors, rep-resenting the brightness values of 64 pixel by 64pixel images of a face rendered with differentposes and lighting directions. Applied to N " 698raw images, Isomap (K" 6) learns a three-dimen-sional embedding of the data’s intrinsic geometricstructure. A two-dimensional projection is shown,with a sample of the original input images (redcircles) superimposed on all the data points (blue)and horizontal sliders (under the images) repre-senting the third dimension. Each coordinate axisof the embedding correlates highly with one de-gree of freedom underlying the original data: left-right pose (x axis, R " 0.99), up-down pose ( yaxis, R " 0.90), and lighting direction (slider posi-tion, R " 0.92). The input-space distances dX(i, j )given to Isomap were Euclidean distances be-tween the 4096-dimensional image vectors. (B)Isomap applied to N " 1000 handwritten “2”sfrom the MNIST database (40). The two mostsignificant dimensions in the Isomap embedding,shown here, articulate the major features of the“2”: bottom loop (x axis) and top arch ( y axis).Input-space distances dX(i, j ) were measured bytangent distance, a metric designed to capture theinvariances relevant in handwriting recognition(41). Here we used !-Isomap (with ! " 4.2) be-cause we did not expect a constant dimensionalityto hold over the whole data set; consistent withthis, Isomap finds several tendrils projecting fromthe higher dimensional mass of data and repre-senting successive exaggerations of an extrastroke or ornament in the digit.

R E P O R T S

22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org2320

on D

ecem

ber

5, 2008

ww

w.s

cie

ncem

ag.o

rgD

ow

nlo

aded fro

m

68

Page 69: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Other Nonlinear Dimensionality Reduction Techniques

I Locally Linear Embedding.I Laplacian Eigenmaps.I Maximum Variance Unfolding.

69

Page 70: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Neural Electroencephalography (EEG)

http://newton.umsl.edu/tsytsarev_files/Lecture02.htm 70

Page 71: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Neural Spike Waveforms

71

Page 72: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Pairs Plot of Principal Components

−5

0

5

−20

−10

0

10

−10

−5

0

5

−5 0 5−5

0

5

−5 0 5

−5 0 5

−20 −10 0 10

−20 −10 0 10

−10 −5 0 5

−10 −5 0 5

−5 0 5

−5

0

5

−5

0

5

−20

−10

0

10

−10

−5

0

5

72

Page 73: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Clustering

I Many datasets consist of multiple heterogeneous subsets. Clusteranalysis is a range of methods that reveal this heterogeneity bydiscovering clusters of similar points.

I Model-based clustering:I Each cluster is described using a probability model.

I Model-free clustering:I Defined by similarity among points within clusters (dissimilarity among points

between clusters).I Partition-based clustering methods:

I Allocate points into K clusters.I The number of cluster is usually fixed beforehand or investigated for various

values of K as part of the analysis.I Hierarchy-based clustering methods:

I Allocate points into clusters and clusters into super-clusters forming ahierarchy.

I Typically the hierarchy forms a binary tree (a dendrogram) where eachcluster has two “children”.

73

Page 74: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Hierarchical Clustering

I Hierarchically structured data can be found everywhere (measurementsof different species and different individuals within species), hierarchicalmethods attempt to understand data by looking for clusters.

I There are two general strategies for generating hierarchical clusters.Both proceed by seeking to minimize some measure of dissimilarity.

I Agglomerative / Bottom-Up / MergingI Divisive / Top-Down / Splitting

Hierarchical clusters are generated where at each level, clusters arecreated by merging clusters at lower levels. This process can easily beviewed by a dendogram/tree.

74

Page 75: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Measuring DissimilarityTo find hierarchical clusters, we need some way to measure the dissimilaritybetween clusters

I Given two points xi and xj, it is straightforward to measure theirdissimilarity, say d(xi, xj) = ‖xi − xj‖2.

I It is unclear however how to extend this to measure dissimilarity betweenclusters, D(Ci,Cj) for clusters Ci and Cj.

Many such proposals though no concensus as to which is best.(a) Single Linkage

D(Ci,Cj) = minx,y

(d(x, y)|x ∈ Ci, y ∈ Cj)

(b) Complete Linkage

D(Ci,Cj) = maxx,y

(d(x, y)|x ∈ Ci, y ∈ Cj)

(c) Average Linkage

D(Ci,Cj) = avgx,y (d(x, y)|x ∈ Ci, y ∈ Cj)

75

Page 76: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Measuring Dissimilarity

76

Page 77: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Hierarchical Clustering on Artificial Dataset

−20 0 20 40 60 80 100

−40

−20

020

4060

80

V1

V2

1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233 34

3536 37

3839

40

4142

43

44 45

4647

4849

50

51

52

5354

55

56

5758

59

60

6162

63

64

65

66 67

68

6970

71

72

7374

7576

777879

80

8182

8384

85

86

87

888990

91

92

93

94

95

96

97

9899

100101

102103

104

105106

107

108109

110

111112

113114

115

116117

118

119

120

121122

123

124

125

126

127128

129

130

131

132133

134135

136

137

138 139

140141142

143144 145

146

147

148149

150151

152

153

154

155156

157

158

159160

161162163

164

165

166

167 168

169

170171

172173

174

175

176 177

178

179180

181 182183

184

185

186

187188

189190191

192193 194195

196 197198

199

200201

202

203

204

205

206

207208

209210 211

212

213

214

215

216

217

218

219 220

221 222223

224225

226

227228229

230

231

232

233

234

235

236237

238

239

240241242

243

244

245246247

248

249

250251

252

253 254 255256257

258259

260261

262

263

264 265

266

267268

269

270

271272

273274

275

276

277

278

279

280

281282

283284285286

287

288

289 290

291

292

293

294

295

296

297

298299

300 301

302

303

304305

306

307308

309

310

311

312313

314

315

316

317

318319

320

321

322

323

324

325

326

327

328

329

330

331

332333334 335 336337

338339

340

341

342 343

344345

346347

348

349

350

351

352

353

354

355

356357

358

359

360

361

362

363364

365

366

367

368369

370371372

373

374375

376377378

379380

381382

383384385

386

387

388

389

390391

392393

394

395

396 397398

399

400

401402

403

404

405

406

407

408

409

410

411412

413

414

415

416

417 418419420

421

422

423424

425426

427

428

429

430

431432433

434

435

436437

438

439

440441 442

443

444445

446447 448

449

450

451

452453

454

455

456

457

458

459460

461

462

463464

465466467

468

469

470

471

472

473

474

475476

477478 479

480

481

482

483

484

485

486

487

488489

490491 492

493

494

495496

497

498499

500

501

502

503 504

505

506507

508

509510

511

512

513

514

515

516

517 518519

520521 522

523

524

525

526

527

528

529

530

531

532

533534

535

536

537

538539 540

541

542

543

544

545

546

547

548549

550

551552

553

554

555

556557

558

559

560

561

562

563

564

565

566567

568

569570

571 572573

574

575576577

578

579

580581

582 583

584585586587

588

589590

591

592593 594

595

596 597

598

599600

601

602603604

605

606

607

608

609

610

611

612

613

614615

616617

618619

620

621622

623 624

625

626

627

628629

630

631

632

633

634635636

637638

639

640641

642643 644

645

646

647

648

649650

651

652

653

654

655

656657

658659

660

661

662663

664 665

666

667

668669

670

671672

673674

675 676

677

678

679

680

681

682683

684

685

686

687688

689

690

691692

693

694

695696

697698699700701

702

703 704

705

706

707

708

709

710

711

712

713

714

715

716717

718

719

720

721722

723 724725

726

727728729

730731

732

733

734

735

736

737738

739

740

741

742743

744

745

746

747

748

749

750751 752753

754755756

757

758

759

760

761762

763764

765

766

767768

769

770771

772 773774

775

776777

778

779

780781

782

783784

785786787

788

789

790791

792

793794 795796797

798

799

800801

802803

804

805

806

807

808

809810

811

812

813814

815

816

817

818819

820821

822

823

824825

826 827828

829

830

831

832 833

834

835

836

837838

839 840

841

842

843844

845

846

847

848849

850

851

852

853854

855856

857 858

859

860861

862

863

864

865866 867

868

869

870

871

872 873874

875876

877878879880

881

882

883

884

885886 887888

889

890

891

892893

894895

896

897

898

899

900

901

902

903904

905

906

907 908

909

910

911912

913

914

915916

917

918919

920

921922

923

924

925926

927

928929930

931

932

933

934

935

936

937

938

939

940

941942

943944

945

946

947

948

949

950951952

953

954

955

956

957

958959960

961

962

963

964965

966967

968969

970

971972

973974

975976

977

978979

980

981

982983984

985

986

987

988

989990

991

992

993

994

995

996997998

9991000

1001

1002

1003

1004

10051006

1007

1008 1009

1010

10111012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

10231024

1025102610271028

1029 103010311032

10331034 1035

1036

1037

1038

1039

1040

1041

1042

10431044

1045

1046

1047

10481049

1050

1051

1052

1053

10541055

1056

1057

1058105910601061

1062

1063

1064

1065 1066

1067

1068

1069

1070

1071

1072

1073

1074

1075 1076

1077

1078

1079

1080

1081

1082

1083

10841085

1086

1087

1088

1089

10901091

1092

1093

10941095

1096 10971098

1099

11001101

1102

11031104

11051106

1107

11081109

1110

1111

111211131114

1115

1116

11171118

1119

1120

11211122

11231124

1125

1126

1127

1128

11291130

1131

11321133

1134

1135 1136

1137

11381139

1140

11411142 11431144

1145

1146

1147

11481149

1150

11511152

1153

1154

11551156

1157

1158

1159

11601161

1162

1163

1164

116511661167

11681169

1170

11711172

1173

1174

1175

117611771178

1179

1180

1181

11821183

1184

11851186

1187

11881189

1190

1191

1192

1193

1194

1195

1196

119711981199

1200

1201

1202

1203

12041205

1206

1207

1208

12091210

1211

1212

1213

1214

1215

1216

12171218

1219

1220

122112221223

1224

1225

12261227

12281229

1230

1231

1232

1233

1234

1235

1236

1237

1238

123912401241

1242

1243

1244

1245

12461247

12481249

1250

1251

1252

1253

125412551256

1257

12581259

1260126112621263

1264

1265

1266 12671268

1269

12701271

1272

1273

1274

1275

12761277

1278

1279 12801281

1282

1283

1284

1285

1286

1287

1288

1289

1290

1291

1292

12931294

12951296

1297

1298

129913001301

1302

1303

13041305 1306

1307

1308

13091310

13111312

1313

13141315

131613171318

1319

1320

13211322

1323

1324

1325

13261327

1328 132913301331

13321333

1334

1335

1336

1337

1338

13391340

13411342

1343

1344 13451346

1347

1348

13491350

1351

1352

1353

1354

13551356

13571358

1359

1360

13611362

1363

1364

1365

1366

1367

136813691370 1371

1372

1373

1374

137513761377

1378

1379

1380

1381

13821383

13841385

1386

1387

13881389

139013911392

13931394

1395

1396

1397

13981399

14001401

1402

1403

1404

1405140614071408

1409

1410

1411

14121413

1414

14151416

1417

1418

1419

1420

14211422

1423

1424

1425

1426

142714281429

1430

1431

1432

14331434

14351436

1437

1438

1439

144014411442 1443

1444

1445

1446

1447

14481449

1450

1451

1452

14531454

1455

1456

1457

14581459

14601461

1462

14631464

1465146614671468

1469

1470

1471

14721473

14741475

1476

1477

14781479

14801481

14821483

148414851486

14871488

1489

1490

1491

14921493

14941495

1496

1497

14981499

1500

15011502

1503

1504

150515061507

1508

1509

1510

1511

1512

1513

1514

1515

1516

1517

1518

1519

1520

1521

1522

1523

15241525

1526

15271528

1529

15301531

15321533

1534

1535

1536

1537

1538

1539 1540

15411542

154315441545

15461547

15481549 1550

1551

1552 1553

1554

1555

1556

1557

1558

15591560

15611562

15631564

1565

1566

1567

1568

1569

1570 15711572

1573

1574

1575

1576

15771578

1579

15801581

1582

1583

1584

15851586

1587

1588

15891590

1591

1592

1593

15941595

1596

1597

159815991600

1601

1602 1603

1604

1605

1606

1607

1608

1609

1610

1611

161216131614

16151616

16171618

1619 16201621

1622

1623

1624

1625

16261627

162816291630

16311632

1633

1634

16351636 1637

16381639

1640

1641

1642

1643

1644

1645

1646

16471648

1649

165016511652

1653

1654

1655

1656

1657

1658

16591660

1661

1662

16631664

16651666

16671668

1669 1670

1671

1672

16731674

1675

167616771678

1679

1680

16811682

16831684

1685

16861687

1688 16891690

1691

1692

1693

1694

1695

1696

1697 169816991700

17011702

17031704

17051706

1707

1708

1709

17101711

17121713

1714

1715

1716

1717

1718

17191720

1721

1722

1723

1724 17251726

1727

1728

17291730

17311732

1733

1734

1735

1736 17371738

17391740

174117421743

1744

174517461747

1748

1749

1750

17511752

1753

1754

1755

1756

1757

1758

175917601761

1762

1763

1764

1765

1766

1767

1768

1769

1770

1771

17721773

17741775

17761777177817791780

1781

17821783 1784

1785

17861787

1788

17891790

1791

179217931794

17951796

1797

1798

1799

1800

1801

1802

1803

1804

1805 18061807

1808

18091810

18111812

18131814

1815

18161817

18181819

1820

1821

1822

1823

182418251826

1827

1828

1829

1830

183118321833

1834

18351836

1837

18381839

1840

1841

1842 1843

1844

1845

1846

18471848

1849

1850

1851

18521853

1854185518561857 1858

1859

1860

18611862

1863

1864

1865

18661867

186818691870

1871

1872

1873

1874

1875

1876

18771878

1879

188018811882

1883

18841885

1886

18871888

1889

18901891

18921893

1894

18951896

18971898

18991900

1901

1902

1903 1904

1905

1906

190719081909

1910

19111912 1913

1914

1915

1916

19171918

1919

1920

1921

1922

1923

1924

1925

19261927

1928

1929

1930 1931

1932

1933

19341935

1936

19371938 19391940

19411942

194319441945

19461947

1948

1949

1950

19511952

1953

1954

1955

195619571958

1959

19601961

1962

1963

1964

19651966

1967

1968

19691970

19711972

1973

19741975

1976

1977

19781979

19801981

1982

19831984

1985

19861987

1988

19891990

1991

1992

19931994 1995

1996

1997

19981999

20002001

20022003

2004

20052006

2007

20082009

2010

2011

2012

2013

2014

2015

2016

201720182019

2020 20212022

2023 20242025

2026

20272028

2029

2030

2031

2032

20332034

2035

20362037

2038

20392040

2041

204220432044

204520462047

20482049

2050

2051

20522053

2054

2055

2056

2057

2058

2059 2060

2061

206220632064

206520662067

2068

20692070

2071

20722073

2074

207520762077

2078

2079

2080

2081

2082

2083

2084

20852086

2087

2088

2089

2090

20912092

2093

209420952096

2097

2098 2099

2100

2101

2102

2103 2104

2105

2106

2107

2108

2109

2110

211121122113

21142115

2116

21172118

2119

2120

2121

21222123

2124

2125

2126

2127

2128

2129

2130

2131

2132

2133

2134

2135

2136

2137

21382139

2140

2141

2142

2143

21442145

2146

2147

21482149

2150 21512152

2153

2154

2155

2156

2157

2158

2159

2160

2161

2162

2163

2164

21652166

2167

2168

21692170

2171

2172

21732174

2175

2176

2177 2178

2179

2180

2181

2182

2183

2184

21852186

21872188

2189

2190

2191

21922193

2194

21952196

2197

2198

2199

2200

2201

2202

22032204

22052206

2207 2208

2209

2210

2211

2212

2213

22142215

2216

22172218

2219 2220

2221

2222

22232224

2225

2226

2227

2228

22292230

2231

2232

22332234

2235

2236

2237

22382239

2240

2241

2242

2243

2244

2245

2246

22472248

2249

2250

22512252

2253

2254

225522562257

2258

22592260

2261

22622263

2264

2265 22662267

2268

22692270

2271

22722273

2274

22752276

2277

2278

2279

2280

2281

2282

2283

2284

2285

2286

2287

2288

2289

2290

2291

2292

22932294

22952296

22972298

2299

2300

2301

2302

2303

23042305

2306

2307

2308

230923102311

2312

23132314

23152316

2317

2318 2319

2320

2321

23222323

2324

23252326

232723282329

2330

2331

2332

2333

23342335

2336

2337

2338

2339

2340

2341

2342

2343

23442345

2346

2347

23482349

2350

23512352

2353 2354

23552356

23572358

2359

2360

23612362

2363

23642365

2366

2367

2368

23692370

2371

2372

2373

2374

2375

2376

2377

2378

2379

23802381

23822383

2384

23852386

2387

2388

2389

23902391

23922393

23942395

2396

2397

23982399

24002401

2402 240324042405

24062407

2408

24092410

2411

2412

24132414

2415

2416

2417

24182419

2420

2421 24222423

24242425

2426

242724282429

2430

2431

2432

24332434

2435

2436

2437

2438

2439

244024412442

2443 2444

2445

24462447

2448

2449

2450

2451

2452

2453

24542455

2456

24572458 2459

2460

2461

24622463

246424652466

246724682469

2470

2471

2472

2473

2474

2475

2476

2477

24782479

2480

2481

2482

24832484

2485

2486

2487

24882489

2490

249124922493

2494

2495

2496

2497

24982499

2500

2501

2502

2503

250425052506

25072508

2509

2510

2511

2512

25132514

2515

2516

2517

2518 2519

2520

2521

2522

2523

25242525

2526 2527

2528

25292530

2531

2532

2533

2534

25352536

2537

25382539 2540

2541

2542

2543

2544

2545

2546

25472548

2549

25502551 25522553

2554

2555

25562557

255825592560

25612562

2563

2564

2565

25662567

2568

2569

2570

2571

2572

2573

25742575

2576

2577

2578

2579

2580

2581

25822583

2584

2585

2586

2587

25882589

2590

25912592

25932594

2595

25962597

25982599

2600

2601

2602

2603

2604

2605

2606

2607

26082609

26102611

2612

26132614

26152616

26172618

2619

26202621

2622

26232624

2625

2626

2627

2628

2629

2630

2631

2632

263326342635

2636

263726382639

2640

2641

26422643

26442645

264626472648

2649

2650

26512652

2653

2654

2655

2656

2657

26582659 2660

266126622663

2664

2665

2666

2667

26682669

2670

2671

2672

2673

2674

2675

2676

26772678

2679

26802681

2682

268326842685 26862687

2688

2689

2690

26912692

26932694

2695

2696

2697

2698

2699

27002701

2702

27032704

2705

27062707

2708

2709

2710

2711

2712

2713 27142715

2716

27172718

2719

27202721

2722

2723

2724

2725

2726

2727

2728

2729

27302731

2732

2733

2734

27352736

2737

27382739

2740

2741

27422743

27442745

2746

2747

2748

2749

2750

27512752

2753

2754

27552756

2757

2758

2759

2760

2761

2762

276327642765 2766

2767

27682769

2770

2771

2772

2773

2774

27752776

2777

2778

27792780

27812782

27832784

2785

2786

2787

27882789

2790

27912792

2793

2794

2795

2796

2797 2798

2799

2800

2801

2802

2803

28042805

2806

28072808

2809

28102811

2812

28132814

2815

28162817

2818

28192820

2821

2822

2823

28242825 2826

282728282829

2830

28312832

28332834

28352836

2837

2838

2839

28402841

28422843

2844

28452846

2847

2848

28492850

2851

28522853

2854

285528562857

28582859

2860

2861

2862

28632864

2865

286628672868

28692870

2871

28722873

2874

28752876

2877

2878

28792880

28812882

2883

2884

28852886

28872888

2889

28902891

2892

2893

2894

2895

2896

2897

2898

2899

2900 2901

2902

2903

2904

29052906

2907

2908

2909

29102911

2912

29132914

2915

2916

29172918

2919

2920

2921

2922

2923

29242925

29262927

29282929

2930

29312932

2933

29342935

2936

293729382939

2940

2941

2942

2943

2944

2945

29462947

29482949

2950

2951

2952

2953

2954

29552956

2957

295829592960

2961

2962

296329642965

2966

2967

2968

2969

2970

29712972

2973

2974

2975

2976

2977

29782979

2980

2981

2982

29832984

298529862987

2988

2989

29902991 2992

2993

2994

2995

29962997

2998 2999

3000

77

Page 78: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Hierarchical Clustering on Artificial Dataset

#start afreshdat=xclara #3000 x 2library(cluster)

#plot the dataplot(dat,type="n")text(dat,labels=row.names(dat))

plot(agnes(dat,method="single"))plot(agnes(dat,method="complete"))plot(agnes(dat,method="average"))

78

Page 79: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Hierarchical Clustering on Artificial Dataset

110

392

694

505

547 39

5 70 24 760 54

121

680

349

036

515

389

089

911

329

4 330 70

9 246

691

2245

153

074

7 620

804

240

516

32 502 4

9713

361

5 158

612

179

454

326

212

627

5123

528

918

02

714 1

1263

178

1 648

254

764 55

366

483 42

273

7 40 231

488

386 398 82

8 201

621

54 68 338 5

09 838

3116

729

934

965

6 256

600

237

703

396

769

670

700

63 706

588 95 17

568

673

040

533

913

038 632 56

747

262

773 26

474

485

273

784

817 35

787

575

18

25 557

474

517

559

746

813

718 750

521

531

450

549 44

76 152

354

872

885

796 79

455

552

221

78 881 34

610

615

431

189

782 184 1

96 457

144

372

693

253

323

28 893 58

5 126

124

634

1729

663

6 424

436

797

397

845 7

83 823

699

88 193

233

310

777 38

244

851

459

162

8 260

284

213

495 52

088

0 616

754

659

383

878 6

82 833

587

785 87

457

573

5 511

592 75 234

291

861

683

879 64

273

1 457

079

0 285

77 335

674

755

343

741

188

720

589 2

38 453

295

551

602

814

738

801 4

2930

548

152

102

617

444

778 202

848

576 3

7036

42 657

163

538

321

780

603

652

859

293

226

604

705

717 2

82 333

434

460

810 3

80 844

103

314

171

776

855

208

445

394

584

223

159

432

877

594

176

206 540

580

81 269 84

149

653

400

716

143

418 65

830

470

245

299 209

347

471

161

162

147

619

172

236

142

378

433 89

222

253

3 556

724

862

864 27

416

838

5 609

766

277

629 40

337

676

8 887 27

935

279

517

842

619

479

331

332

9 367

247

806

626

812

637

527

743

865

808

542

512 26

129

230

812

334

677

771

723 2

8649

980

519

237

783

7 849 59

071

013

552

4 856

281

563

548

359 2

4211 351

207

297

368 23

989

816 500

155

847 26

337

944

341

2 834

5621

564

126 503 25

925

849

419

778

727

374 701 88

6 504

668 5

890 341 10

867

567

8 458

889

882

651

842

742

748 25

231

269

7 198

218 3

3237

146

5 466 288

417

786

13 300

664

826 6

551

060

773

967

269

877

4 29

322

719

388

545

46 722

210

391 8

40 165

136

721

276

579

469

375

736 1

1782

286

612

858

669

676

3 513 80

734

575

764

414

058

259

968

039

314

171

389 195

477

37 336

467

605

230

431

442

830

486

873

518

752 50

134

145

767

820 57 779

759

2347

653

757

8 104 11

985

419

148

7 858

428 227

871

596

618

268

410 1

3959

861

180

081

689

116

057

175

680

979

2 219

271

870 55

364

0 660

721

129

830

189

453

270

646

597 72

8 534 665

888

479

635

35 704

214

265

66 225

464

643

485

555

900

515

673

853

564

798

187

577

791

821

625

788

1853

671

1 360

173

715

123

356 50

828

732

083

543

064

7 306

170

205

20 373

49 278

492

550 4

02 692

41 690

478

839

423

794

762 1

1480 200

543

844

658

368

473

2 761

734

484

67 369

645

654 53

550

712

125

136

3 811 66

930

366

738

142

1 729

480

827

190

449

183

613

315

109

661

249

257

440

447

896

156

860

302

733

344

355

472

836

87 189

970

786

7 48

473

802

290

43 132

272

122

851

758

129

624

863 3

28 415 6

0131

742

039

940

665

045 407 14

892

526

528

461

843

501 1

5150

669

576

5 6940

167

1 869

125

408

846

546

489

15 164

745

782

841

622

681

712

3364 199

97 525 131

389

409 48

221

250

475

544

83 94 127 868 68

522

441

1 358

419

459 1

81 825

220

364

404 7

99 630

496

532

266

468

1924

852

3 331

340

815

560

59 185

6211

855

4 228

362

39 319

177 98

116

120

283

522 49

337

457

293

614

655

414

633

649

427

595 89

518

657

388

472 244

539

441

384

342

593 34

843

983

2 753

157

666

498

361 3

3711

156

1 772 5

4313

858

151

910

020

478

915

046

356

885

7 267

676

275

413

689 4

6252

963

982

422

945

668

7 688

101

169

166

679

146

740

255

558

775

818

437

324

876

491

565

105

307 32

556

620

391

232

316

574

725

819

883

137

243

390

318

569

27 638

217

662

60 726 3

5010

717

438

660

814

85 850

110

606

831

245

727

96 435

470

1451 62

328

071 182

630

1683

901

1748

919

1390

1767

1931

1078

1094

1489

1151

1400

1877

2037 1

592

1999 19

9114

9411

2318

2515

8318

70 933

983

2001 936

1827

1006

1791 17

9896

397

516

5110

9819

3517

2519

4911

7216

2113

67 1144

1170

1946

1183

1606 15

0915

8115

2212

9816

6918

9710

9515

6113

8219

0713

1518

31 1050

1586

2050 18

4019

8419

2513

2911

0913

11 1267

1858

1272

976

1711

1226

1468

1247

1512

1782 18

6419

4013

0313

3015

3015

73 170

011

2912

8714

6216

2514

7218

9313

0916

2018

3215

2717

37 100

216

5519

4417

3115

4611

7316

96 1498

1978

1917

926

1042

1958 14

5918

7913

1313

6218

02 1936

1007

1579 18

1312

8515

5219

9413

8817

34 1076

1369

2029

1708

2020 10

7011

0113

4611

7618

6716

3520

40 143

614

2513

4210

4716

9412

41 109

693

713

6820

49 944

989

981

1764

1564

1759

1420

1225

1036

1107 1

299

1029

1482

1569

1762

1604

959

1636

1125

1189

1483

1710 1

691

1932

1122

1359

1559

1649

1892

1355

1456

1828

1976

974

984

1919

1422 16

1510

0812

93 1223

1143

1909 18

7415

5016

5219

72 146

314

7420

2315

7715

8016

3218

0910

6112

3015

6212

6210

5418

48 1839

1239

1963

1603

1780

1777

1281

1835

1664

1663

1806

1860

1937

1105

1587

1208

1816 9

3417

6919

2310

5515

7815

29 1967

1059

1268

1890

1630

1340

1585

1383 18

5510

1219

1613

7219

04 1847 16

3410

6613

3915

6714

0519

65 1548

950

1822 11

7412

8018

68 1682

1453

1740

1624

2007 1

575

1533

1104

1974

2031

1058

1826

1908

1464 99

617

8414

6516

5019

5719

4110

8813

4199

792

018

33 1389

1126

1057

2035 13

84 149

310

2416

6110

7412

6313

74 1302

1751

1572 11

8011

0218

1115

4112

3415

05 115

717

5792

813

7619

0210

8617

6020

2219

4512

0312

0796

210

1415

3113

26 1399

1306

1969 1

478

971

1049

1133

1458

1637

1100

1275

2010 14

7620

3312

6414

4913

1691

311

6212

6017

0717

7893

294

9 101

517

6519

9894

114

3817

7915

1518

0418

7811

4912

05 1219

1773 9

0713

3115

4418

1811

9819

8311

6517

8315

1919

9018

6212

3213

6014

1616

7420

03 977

1623

1197

1555

1873

1332

1659

1717

1385

1715

1514

1153

1508

1676

1181

1532

1545

1171

1736 17

6616

6219

11 152

393

813

33 102

512

8418

5213

4417

24 131

818

9619

2917

7620

1514

1314

1999

510

3719

81 904

1871

988

1046

1128

1253 16

4710

8514

2499

417

47 1713

1159

1697

1716

1912

1397

1407 19

8218

5610

1111

4213

1212

7911

5214

0618

98 982

1114

1597

1432

1211

1654

1084

1118

1411

1113

1454 10

2811

0318

6614

67 152

812

2216

81 164

211

3814

15 1256

1314

1501

1885 15

58 1704

916

1457

1466

1108

1216

1979

1345

1952 11

4111

47 181

591

517

20 1005

1845

2025 13

5319

9692

415

11 1602

1959

1328

1712

1136

1938

1956

1484

970

1093 19

8512

0612

1418

7610

2616

7713

00 1679

1178

1370

1186

1237

1381 13

7912

5516

1218

3712

9514

99 1282

1901

1068

1259

1274

1758 93

098

518

8496

720

1110

87 1746

1913

1140

1231

1245

1375

1403 17

4319

61 1488

1212

1441

1643

1888 16

2620

32 188

299

116

1619

21 103

911

5611

66 1795

1947

1185

1242 16

3314

8614

8515

3716

38 986

2002 1

726

1167

1266

1869

1989

1699

1805

1323

1391 1

179

1240

912

1960 11

32 948

1781

1444

1687

1927 19

0615

0320

08 1551

1308

1479

1448

1510

987

1439

2045

1215

1598 1

304

1673

956

978

1335

1513

1824

1729

1060

1971

1730

1204 1

003

1257

1793

1987

1106

1182

1690

1789 20

30 1518 10

2310

9217

0214

4514

7115

2496

110

6413

2119

70 154

014

9020

1293

115

6317

8893

514

3711

9513

3619

95 1030

1742

1875 13

5210

0411

4613

6192

212

5412

33 1955 13

4814

3415

9110

1312

61 1628

1939 1

506

1322

1918

1378

2024

1130

1398

1834

1863

1859 14

5594

011

3513

6514

0291

192

918

46 1754

1210

1358

1785 15

3418

08 1973

1243

1727

2047

1027

1619 19

4813

9319

97 1075

1814

1079

1943 15

47 1905

1112

1953 1

252

1350

1750 14

9618

19 130

113

9613

9416

1717

3815

2117

03 159

614

7311

8816

6020

09 1065

1903 16

8010

4418

3012

7012

9616

1392

710

9019

6617

0117

8618

8320

3618

9111

9218

41 969

1721

1763

966

1276

1770

903

1117

1688

1796 95

215

4310

4110

8217

56 162

918

3615

49 1535

1803

1857

1072

1139

1317

1062

1639

1115

1155 18

8112

2813

4316

1120

1319

4219

86 960

1560

972

1487 1

614

1177

964

1150

1168 10

4013

34 1357

1761 12

8318

72 2016

1292

908

1349

1246

1307 1

584

1895

1914

947

1752

1423

1810

1732

1829 16

7012

6914

21 193

495

117

9016

89 979

1609

1526

1698 10

9711

3114

4312

5812

91 204

818

6111

6015

53 202

191

711

6310

9914

9710

2210

7710

3518

51 1111 1

954

1470

1653

990

1213

1706

1217

1641

1656

1018

1605

1395

2043

1854 90

920

46 1305

1417

1412

1693

1993

1539

1608

1193

1354

992

1071

1286

1290

1794

1817

2044

1739

1705

1031

1073

2039

1191

1461 14

2714

4616

8611

6116

9211

1011

8414

52 1347 9

7310

5217

9720

42 1556 12

9712

6515

4216

4020

0519

3320

2616

2712

7713

92 177

112

4410

8013

5618

1214

9218

38 1695

1570

1657

1418

1977

1091

1236

1568 9

8011

1916

2218

80 178

712

9414

3516

3110

4312

7117

18 1380

1089

1926

1121

1250 9

5818

8714

0915

9414

2914

4019

62 142

812

2115

02 100

012

1816

7210

1616

00 153

818

4219

68 1920

1684

946

1886

1017

1733

918

1363

1792

1799

1950

1980

1063 1507 14

6917

2214

7515

3615

7416

7120

1790

295

711

9613

37 1491

1582

1894 17

1419

2819

8820

2713

1916

67 201

999

310

5115

6620

38 123

810

1915

20 1554

1900

1158

1164

1447

1709

1410

2006

1404

1450 9

2317

35 2018 1

145

939

1209

2041 12

5112

88 943

1154 16

65 999

1033

1741 14

6013

1013

77 200

4 1666

1889 1

169

1414

1034

1433

1865

1924

1187

1504 13

5195

411

2415

9394

211

9012

7812

2914

9511

4814

3119

92 1749 12

0217

4518

53 158

813

8710

09 945

921

1915

906

1001 1

576

1386

1477

1056

1480

1069

905

1645

1273

1775

1820

1481 14

0810

10 968

1565

1571

1719

965

1646

1021

1426

1964

2014 1

910

1517

955

1500

2028

1320

1324

914

1768 20

3410

6715

99 1045

1755

1116

1048

1843

1744

1823

1821

1235

1200

1224

910

1053

1801

1844

1083

1849

1675

1199

1595

1678

1327

1220

1668

1774 13

2518

99 1194

1516

1607

1364

1175

1032

1371

1020

1366

1590

1753

1807

1850

1975

1137

1589

1648 1

134

953

1800

1723

1081

1644

925

1601 1

227 15

5714

0120

0019

51 1618

1442

1772

1338

327

1430 11

2012

8919

2270

810

3816

1012

0113

7312

4917

2815

2519

3024

142

534 663 6

182

911

599

816

8520

5125

0729

4820

5521

2029

3828

25 2860

2169

2925

2149

2407

2213

2066

2703 22

8126

2924

6520

7822

8421

1526

9721

3926

0225

2129

3526

4924

1429

6427

8828

3226

1427

37 2409

2954

2097

2658

2669

2123

2769

2267

2400

2570

2786

2305

2308 29

3025

29 2133

2489

2672 22

0925

3628

96 246

427

0628

0621

2224

4027

2027

7825

9228

85 2615

2721

2818

2751

2878

2973 27

0926

2028

88 283

728

6920

7123

8326

17 2815

2328

2348 25

1329

6529

98 2476

2597

2142

2781

2084

2098

2546 28

70 2090

2329

2485 23

1123

5120

9623

0028

11 285

622

0123

34 2245

2471

2809

2235

2277

2487

2753 25

66 2408

2413

2976

2836

2928 24

2920

9525

49 2265

2677

2898

2349

2374 28

6428

55 2410 22

1922

6228

7628

7323

1528

12 268

129

9224

1523

8223

6925

3526

5325

7928

5228

83 2427

2401

2498

2755

2462

2639

2491

2594

2343

2635 23

6526

22 2686

2493

2969

2276

2929 2

428 21

5823

6127

9320

8021

3725

6225

3121

3023

22 2425

2232

2290

2683

2760

2293

2458

2984

2178

2621

2902

2205

2475 2

460

2278

2403

2532

2704

2582

2404

2656

2438 25

1722

4627

6129

5120

6021

4626

0725

4329

11 273

825

64 2730

2805 27

5020

6424

7922

5222

30 2188

2673

2075

2168

2337

2559 27

9528

3529

00 2416

2700

2695 28

7522

8624

2126

0921

2822

7023

5229

6123

9024

7421

9123

9125

0423

6424

5926

3323

1723

60 2558

2424

2914

2611

2651

2820

2439

2872

2393

2584

2436

2514

2576 20

6222

9723

41 2207

2103

2814

2829 2

392

2575

2569

2865

2953

2813

2881

2884

2121

2222

2112

2229

2548

2675

2816

2511

2823

2174

2810

2131

2652

2692

2908

2143

2378

2057

2089

2268

2707

2175 21

4824

9923

4626

2427

8228

2620

6727

6522

2428

4823

9426

80 2644

2118

2228

2527

2573 24

05 2236

2370

2742

2445

2843

2190

2238

2211

2254

2398

2568

2746 2

204

2567 2

743

2842

2110

2616

2295

2114

2135

2312 2

645

2822

2170

2868

2273

2239

2266

2335

2824 25

0229

3122

2323

5725

5626

8926

2727

14 2117

2323

2797

2124

2515 22

4925

6124

9222

1724

0625

2526

38 2844

2802

2451

2263

2132

2358

2705

2741 28

6323

8820

8529

8823

7629

9322

5628

49 2647

2478 25

06 2113

2151

2827

2423

2373

2968 23

8125

05 2719

2309

2985

2368

2324

2978 28

4026

9129

87 2747

2648

2995

2444

2661

2173

2780

2257

2306

2828 25

0025

5328

6128

91 298

627

17 227

220

8826

57 2282

2091

2162

2450

2956

2109

2269

2271

2728 24

1120

9229

4224

30 296

022

5122

6028

38 2604

2682

2632

2225

2758 23

0224

3423

7124

96 2587 2

244

2915

2554

2698

2420

2461

2516

2522

2684

2913

2608

2107

2326

2385

2975

2395

2618

2668

2218

2939

2715

2186

2231

2193

2776

2291

2526 2

386

2972

2418

2947

2530 29

09 2678

2052

2431 22

9627

1127

1320

7721

3421

5026

2829

50 2166

2181

2214

2237

2980

2303

2733

2770 2

294

2063

2255 25

4527

4827

79 2053

2596

2076

2099

2867

2331

2557

2694

2187

2399

2203 24

6629

9626

4228

5723

3323

6227

8920

9427

5624

2225

5224

5728

9028

94 276

726

3429

05 2643

2664

2494

2419

2630

2074

2981

2846

2585

2437

2936

2182

2777 2

227

2292

2880 2

375

2307

2519

2722 29

4427

85 295

927

9829

67 2145

2240

2775 21

0226

5527

83 248

230

0021

3621

9522

48 2736 2

762

2058

2990

2086

2108

2625 23

1629

4629

5824

5426

0627

68 267

421

5623

5321

6723

1829

34 2957 21

7724

7225

6329

0725

4426

59 2389

2723

2882

2105

2833 29

1228

8622

7529

2124

9725

60 2687

2859

2708 28

5027

3925

1826

8528

04 273

122

6120

6123

59 2441

2484

2577 27

2529

2726

5429

9723

3627

6620

6525

9321

92 2159 29

1028

8721

7922

2024

5520

6829

7422

4723

8424

17 2259

2899 24

7025

3323

1424

73 2726

2172

2970

2547

2877

2989

2800

2483

2079

2199

2745

2387

2298

2646

2841

2773

2082

2469

2901 29

71 270

128

6623

1929

4129

6328

3125

7428

0823

5426

93 2463

2379

2637

2509

2600 27

8728

9225

4125

5027

1829

9920

8322

4221

4129

2020

9326

4121

2623

9722

0624

81 2274

2435

2449 22

3324

5627

5221

6427

4021

3828

3421

8522

3423

3825

0127

63 258

325

7223

0426

6025

3923

6325

4229

3729

24 2488

2432

2081

2111

2279

2468

2528 2

879

2208

2667

2790

2087

2226

2327

2161

2350

2922 2

650

2994 27

12 2665

2906

2280

2153

2503

2520 23

4225

9920

5627

8426

0128

4529

4925

1222

8323

5520

6925

89 217

127

2929

1625

3422

1229

4322

0223

2128

7421

9426

36 221

627

9922

8729

52 2977

2571

2160

2955

2127

2918

2180

2749 21

63 2807

2565

2839

2250

2754

2771

2688

2871

2332

2299

2452

2101

2119

2759

2072

2285

2598

2372

2367

2258

2613

2330

2140

2716

2734

2155

2603

2523

2962

2581

2591

2198

2917 24

4829

2629

3223

0124

8624

4226

4023

4526

7928

9320

7326

9626

7622

8923

2525

88 2510 29

8224

9525

8626

1027

7229

9126

2320

5928

1922

2123

1021

0626

7028

9525

7822

1522

8827

4424

4321

4423

66 252

424

0226

9924

2629

0422

6421

1621

8921

9628

8923

4726

7124

8027

9623

5623

9625

0829

7928

4721

7621

0026

1925

3726

1221

4725

3821

2922

41 2595

2794

2580

2830

2791

2152

2490

2433

2735 26

2627

2727

5727

3223

4427

1029

1927

0228

9720

5421

6523

3921

0428

5822

5327

24 244

6 2340

2540

2453

2690

2933

2184

2605

2966

2183

2853 29

0321

2525

5528

5424

4722

0023

2027

7426

6622

4326

3129

83 259

028

0121

5421

5723

1328

5124

1228

6224

7721

9727

6425

5128

1720

7028

2128

0329

4529

40 309

1127

1658

2792

2923

610

1248

387

562

2662

2380

2467

353

770

2377

2210 41

674

926

63

02

46

810

Dendrogram of agnes(x = dat, method = "single")

Agglomerative Coefficient = 0.93dat

Hei

ght

79

Page 80: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Hierarchical Clustering on Artificial Dataset

124 760

392

694

395 10 541 70

505

547

153

890

899

22 365

212

627

13 300

826

664

698

774

713

140

582

599

680

268

410

393

667

6551

060

767

273

982

286

639

940

665

049

029 322

388

545

719

276

579

320

835 86 398

828

242

136

721

469

375

736

165

210

391

840

1853

671

136

017

371

528

751

235

289

180

246

691

123

356

508

430

647

7378

481

717

020

580 200

306 3 54 68 33

883

820

162

140

563 706

175

588 95

686

730

38 632

262

773

264

357

875

509

567 47

744

852

751

135

524

160

571

756

809

792

192

377

837

849

590

710 8

474

517

521

531

25 557

559

746

813

718

750 44 79

455

552 31

167

299

349

656

256

600

396

769 74 701

886

196

46 722

117

237

703

670

700

76 152

354

796

872

885 82 184

130

339

41 690

478

839

109

661

249

257

440

447

896

111

561

157

666

498

337

1433 64 199

97 525

131

59 185 62

118

554

389

409

482

113

294

330

709

216

803 20 373

49 278

402

423

794

762

492

550

692

186

573

884

228

362

114

427

595

895

414

633

649

229

462

491

565

27 638

318

569 43

772 244

441

361

348

384

539

439

832

753

772

138

581

519

342

593

543

60 726

350

107

241

425 3

8741

674

9 687 189

472

836

681 20

312

786

868

535

841

945

949

653

218

182

541

368

922

079

936

440

463

011 351

207

297

898

119

854

239

368

564

23 476

104

537

578

191

487

428

858 21

250

475

544

224

411

190

449

302

733

67 369

645

654

183

613

315

507

535

121

669

251

363

811

344

355

83 94 272

504

668 84 658

143

418

149

653

400

716 19

248

523

331

340

815

560

85 850

110

606 83

196 435

470 14

5112 334

677

771

723

198

218

856

332

466

371

465

417

786

853

156

860

798

288

800

464

643

485

555

900

28 893

585

126

81 269

383

878

874

587

785

682

833

144

372

693

253

323

154

311

897

450

549

291

861

683

879

642

731 75 234

308

871

281

563

359

548

286

499

805 16 500

155

847

263

379

443

412

834

219

553

640

271

870

660

66 225

328

415

26 503

259

258

5621

564

121

349

552

088

057

573

561

675

465

957

779

182

178

810

020

478

955

877

581

881

988

312

285

160

175

812

962

486

331

742

048

910

116

914

616

667

9 740

150

463

568

857

187

515

673 82

4 623

353

271

411

263

178

164

840 231

488

139

598

611

816

891 37 336 50

467

605 55

366

483

422

737

254

764

134

145

767

820

304

702

452

159

540

176

206

432

877

594 45 407 92

526

528

148

712

151

506

695

765

461

843

501 4

570

790

285

77 335

674

755

343

741

102

617

444

778

223

36 42 657

163

538

321

780

603

293

652

859 78 881

221

106

346

124

634

457

260

284

103

314

305

481

429

171

776

855

238

453

580

295

814

551

602

738

801

197

787

273

222

533

724

556

862

864

759

52 370

202

848

576

188

720

589

460

810

208

445

394

584

226

604

705

717

282

333

380

434

844

128

513

345

757

644

511

592

586

696

763

807 5

438

484

446

583

684

732

761 53

270

646

597

728

888

534

665 7

211

298

301

894 89 195

477

480

827

99 209

479

635

194

793

313

329

261

292

35 704

214

265 54

623

044

283

043

148

687

351

862

514

237

843

389

216

838

560

976

637

676

814

761

917

223

616

116

234

747

127

762

988

727

940

330

375

217

424

436

296

636

699

596

618

274

397

845

783

797

823

88 193

233

310

777

448

514

382

591

628 58 90 341

141

227

252

312

697

458

889

882

651

842

43 132

745

6940

167

112

540

884

686

949

452

774

386

554

257 779

352

795

108

675

678

742

748

178

426

367

247

806

637

626

812

512

808 39 319

177

255

9811

612

028

352

249

337

457

210

530

732

556

613

724

3 390

245

727

267

676 72

938

142

173

462

232

487

657

472

5 71 182 27

528

061

012

48 970

786

7 4829

047

380

2 15 164

782

841

266

468 9

123

231

634 663 61 829 11

5 309

32 502

497

240

620

804

133

615

158

516

451

530

747 93

614

655

179

454

326

612

456

687

688

174

386

608

217

662

529

639 30

1683

1081

1644

943

1154 99

916

6510

3411

6912

5112

8814

3318

6519

2470

810

3816

1011

2012

89 1922

327

1430

965

1646 1021

1069

1187

1504

1351

953

1800

1200

1235

960

1560

1357

1761

992

1071

1286

964

1334

1040

1150

1168

1283

1872

917

1163

1099

1497

1022

1077

1217

1641

1656

1018

1705

1035

1851

1111

1954

1470

1653

1722

944

989

1834

1863

981

1764

1564

1759

1420

1029

1299

1604

1482

1569

1762

1036

1107

1225

1180

1359 98

714

3920

4512

1515

9813

0499

012

1317

0695

916

3614

8317

1011

2511

2211

8916

9111

3013

9818

5914

5618

2819

7690

813

4912

4613

0715

8418

9519

1494

717

5216

7014

2318

1017

3218

2995

117

9016

89 979

1609

1526

1698 10

0911

6116

9212

6914

2119

3495

018

2211

7415

3315

7516

2420

0710

1312

6116

2819

3911

0419

7414

9318

8212

8018

6816

8214

5317

4010

9712

5812

9111

3114

4320

4812

2115

0211

6015

5318

6120

21 921

1915

958

1887

1428

1409

1594

1429

1440

1962

922

1254

1233

1955

931

1563

1195

1336

1995

1788

1030

1352

1742

1875 93

514

3713

6194

618

8610

1710

0411

4610

4418

3012

7012

9211

9915

9516

78 1327

1744

1823

1821

925

1601 9

9810

3213

71 1227 1

685

914

1768

2034

1048

1843

1067

1599

1045

1755 11

1610

0012

1816

7216

8410

1616

0015

3818

4219

6819

2010

4313

8012

7117

1810

8919

26 1175

1565

1194

1134

1338

1364

1516

1607

1249

1728

1320

1324 15

5790

117

4815

2717

3711

2919

1712

8714

6216

2514

7218

9313

0916

2018

3210

0216

5519

4414

2511

7316

9619

7814

9812

1214

41 985

1488

1546

1731

1743

1961

2016

1064

1540

1321

1970

1490

2012

928

1086

1203

1207

1264

1449

1476

2033

1760

2022 93

011

0012

7520

1013

1696

210

1415

3113

9913

2613

7619

0219

4516

4318

8813

0619

6914

7813

8817

3416

4918

9291

219

6011

3215

0320

0819

0613

0814

7914

4815

1015

5194

817

8114

4416

8719

2713

4814

3415

9197

110

4911

3314

5816

3711

4012

3112

4513

7514

03 916

1457

1466

1108

1216

1979

1345

1952

1141

1147

1138

1415

1528

1901

1068

1259

1815

1153

1181

1532

1545

982

1114

1597

1211

1432

1654

1103

1866

1467

1222

1681

1642

1256

1314

1501

1885

1558

1028

1084

1118

1411

1214

1876

1113

1454

1704

1136

1938

1956

1484 92

610

4219

5813

1314

5918

7910

9613

6218

0219

3611

8612

9514

9912

5516

1218

3712

8212

3713

8113

7997

010

9319

8512

0610

2616

7713

0016

7911

7813

7010

4716

9412

4114

1314

1910

0715

7918

1317

0820

2012

8515

5219

9410

7613

6920

2910

7014

3611

0113

4611

7618

6716

3520

40 920

1833

1389

1126

933

983

2001

1342

936

1827

1006

1791

1798

976

1711

1226

1468

1669

1897

1884

1247

1512

1782

1864

1940

1303

1330

1700

1530

1573 99

617

8416

5019

5719

41 997

1465

1057

2035

1384

1322

1918

1088

1341

1274

1758

963

1725

1949 975

1651

1098

1935

1144

1298

1522

1170

1946

1095

1561

1382

1315

1831

1907

1109

1311

1050

1925

1984

1172

1621

1367

1329

1586

2050

1840

1061

1262

1230

1562

1580

1632

1809 90

418

7199

417

4713

9714

0716

4719

8298

811

2812

5310

4610

8514

2411

5916

9717

1319

1217

1618

5691

517

2010

0518

4520

2513

5319

96 924

1511

1602

1328

1959

1011

1142

1152

1279

1312

1406

1898

1062

1639

1115

1155

1228

1881

1549

1712

1343

1611

2013

1673

1942

1986 94

211

9012

7812

2914

9510

2314

4514

7115

2412

9017

9418

1720

44 956

978

1335

1513

1003

1092

1702

1060

1971

1204

1730

1729

1824

1257

1793

1987

919

1390

1767

1931

1078

1183

1606

1509

1581

1094

1489

1991

1151

1400

1877

2037

1494

1825

1583

1870

1592

1999

1024

1661

1074

1263

1374

1751

1302

1572

1267

1858

1272 96

110

6519

0316

8011

2313

5511

0615

1811

8220

3016

9017

89 913

1162

1260

1707

1778

1515

1804

1878 97

414

6410

5818

2619

0898

419

1914

6314

7420

2315

7710

0812

9312

2314

2216

1511

4319

0918

7415

5016

5219

72 932

949

1015

941

1438

1779

1149

1205

1787

1102

1811

1757

1541

1157

1234

1505

1765

1998 934

1769

1923

1055

1578

1529

1967

1746

1339

1567

1405

1965

1548

1012

1916

1634

1066

1372

1904

1847

1059

1268

1890

1630

1855

1340

1585

1383

1054

1848

1663

1806

1839

1105

1587

1208

1816

1239

1963

1603

1780

1777

1281

1835

1664

1860

1937

1219

1773

1913

1626

2032 92

796

917

0117

8617

2117

6318

8320

3610

9019

6618

9111

9218

4196

720

1110

87 995

1037

1981 96

612

7617

7014

9115

8218

9417

1419

2819

8820

27 918

1507

1063

1792

1799

1363

1950

1980

1052

1797

2042

1265

1542

1627

1640

2005

1933

2026 98

012

7713

9211

1916

2218

8010

8013

5618

1214

1819

7712

4417

7112

2016

6817

7413

2518

9914

6914

9218

3816

9515

7016

57 945 955

1500

2028 99

310

5115

6620

3812

3810

5614

80 1201

1224

1723

973

1556

1110

1347

1184

1452 1148

1431

1992

1749

1019

1520

1554

1900

1158

1164

1447

1709

1091

1236

1297

1568

1410

2006

903

1117

1796

1688

952

1543

1041

1082

1756

1629

1836

1535

1803

1857

1179

1776

2015

938

1333

1025

1344

1724

1284

1852

1165

1783

1519

1990

1862

1416

1674

2003

1605

907

1232

1360

1072

991

1616

1921

1039

1198

1983

1331

1544

1818

1156

1166

1795

1485

1537

1638

1185

1242

1633

1947

1486

1027

1619

1948

1393

1997

1075

1814

1139

1317 91

111

8897

214

8716

1411

9313

5492

918

4617

5416

6020

0912

9616

1312

1013

5817

8512

4317

2720

4715

3418

0819

7393

713

6820

4919

3213

1818

9615

2319

2915

5920

3197

716

2315

1411

9715

5518

7313

3216

5917

1713

8517

1511

7117

3617

6616

6219

1115

0816

76 909

2046

1305

1417

939

1209

2041

1191

1461

1427

1446

1686

1412

1693

1993

1539

1608

1031

1073

2039

1394

1473

1112

1953

1252

1396

1177

1301

1350

1750

1496

1819

1033

1741

1460

1889

1310

1377

2004

1666

1414

940

1135

1365

1402

1294

1435

986

2002

1726

1240

1869

1989

1631

1699

1805

1167

1266

1323

1391 17

3910

1011

2112

5010

7919

4315

4719

0515

2117

0315

9616

1717

3812

0215

8814

0817

4518

5310

2018

5019

75 1648

1137

1589

1366

1590

1753

1807

1401

2000

1951

1525

1930

1442

1772

1618

902

957

1196

1337 91

010

8318

4910

5318

0118

44 906

1001

1576

1404

1450

1386

1477 15

1790

516

4512

7314

8117

7518

2013

1916

6720

19 923

1735

1145

954

1124

1593

1426

1910

1964

2014 96

816

7120

17 1387

1475

1536

1574

1733

1571

1719

1378

2024

1455

1506

1395

2043

1854 2018

1127

1373

1675 16

5856

226

6223

80 2467

2059

2819

2578

2215

2288

2744

2150

2628

2950

2571

2221

2310

2497

2560

2739

2327

2443

2106

2670

2895

2356

2396

2508

2979 2

125

2129

2241

2595

2244

2915

2554

2698

2299

2452

2453

2690

2933

2792

2052

2431

2296

2711

2713 2077

2454

2674

2606

2768

2912

2105

2833

2886

2167

2333

2177

2472

2907

2389

2563

2723

2147

2538

2318

2934

2957

2544

2659

2402

2699

2426

2904 20

7028

2121

8328

5326

3129

8325

5128

1722

7529

2125

1827

3126

8728

5928

5027

0822

8028

8729

6324

8326

8528

0424

4829

2629

32 2524

2586

2610

2772

2991

2055

2825

2120

2938

2860

2706

2806

2149

2407

2213

2169

2925

2267

2400

2305

2308

2930

2529

2570

2786

2066

2703

2281

2629

2078

2284

2139

2602

2465

2097

2658

2669

2709

2878

2973

2489

2672

2521

2935

2115

2697

2954

2620

2888

2837

2869

2409

2788

2832

2614

2737

2414

2964

2649

2107

2186

2385

2975

2193

2776

2231

2138

2834

2185

2234

2338

2583

2218

2939

2715

2291

2526

2909

2418

2947

2530

2326

2395

2618

2668

2386

2972

2678

2080

2137

2562

2531

2130

2322

2425

2232

2290

2683

2760

2293

2458

2984

2413

2976

2836

2928

2608

2809

2084

2098

2546

2870

2245

2471

2158

2429

2276

2929

2428

2748

2779

2362

2789

2522

2684

2913

2063

2255

2545

2103

2814

2829

2392

2575

2088

2657

2282

2604

2682

2569

2865

2953

2091

2162

2450

2956

2109

2271

2728

2269

2411

2225

2758

2371

2496

2587

2302

2434

2092

2942

2430

2632

2251

2260

2838

2960

2173

2780

2306

2828

2500

2986

2553

2861

2891

2134

2166

2181

2214

2257

2717

2237

2980

2303

2294

2733

2770

2051

2507

2948

2420

2461

2516

2079

2199

2387

2745

2112

2229

2143

2378

2511

2823

2174

2810

2548

2675

2816

2131

2652

2692

2908

2060

2738

2146

2607

2564

2730

2805

2543

2911

2750

2298

2646

2841

2630

2773

2062

2207

2297

2341

2121

2128

2961

2813

2881

2884

2270

2352

2390

2474

2074

2981

2585

2846

2307

2375

2437

2936

2250

2332

2182

2777

2227

2292

2880

2127

2918

2163

2180

2749

2565

2839

2807

2688

2871

2754

2771

2058

2990

2316

2086

2108

2625

2910

2272

2946

2191

2391

2504

2439

2872

2317

2360

2558

2424

2914

2611

2651

2820

2393

2584

2436

2514

2576

2085

2988

2376

2993

2959

2256

2849

2506

2444

2478

2647

2113

2381

2505

2151

2827

2423

2719

2373

2968

2309

2985

2368

2958

2094

2756

2831

2114

2135

2312

2645

2822

2863

2156

2353

2422

2552

2767

2457

2890

2894

2494

2634

2905

2643

2664

2064

2479

2252

2188

2230

2673

2075

2168

2337

2559

2795

2835

2900 22

3520

9623

0028

1128

5622

0123

3422

8624

2126

0924

1627

0026

9528

7522

1922

7725

6624

0824

8727

5323

4923

7428

5528

6423

8820

7123

8326

1729

6529

9823

2823

4824

7625

1320

9023

2924

8523

1123

5121

4227

8125

9728

1520

9525

4926

7728

9822

6524

1024

9329

6921

2425

1522

4925

6121

3223

5827

0524

9222

1728

4424

5128

0224

0625

2526

3822

6327

4123

6424

5926

3322

6228

7628

7324

6226

3924

9125

9423

1528

1223

8224

0124

9827

5524

1526

8129

9223

4326

3523

6526

8626

2223

6925

7928

5228

8324

2725

3526

5320

5627

8426

0128

4529

4925

1222

8323

5521

5526

0323

4523

6721

0226

5527

8324

8221

3627

3630

0021

9522

4827

6221

4027

1627

3421

9829

1725

2329

6226

7928

9324

1228

62 247

722

5327

2424

4623

4427

1029

1920

6829

7421

7925

7428

0821

6029

5522

4723

8424

1722

5928

9923

1424

7327

2624

7025

3323

2429

7828

4026

4829

9526

6126

9129

8727

4721

0026

1925

3726

1221

7229

7025

4728

7729

8928

00 215

721

9628

89 2791

2313

2851

2116

2189

2480

2796

2347

2671 24

4728

3029

0328

0329

4577

029

4020

7222

8525

9823

7221

7625

8021

5323

4224

3225

0325

2025

9925

5528

5420

9326

4121

2623

9724

5627

5225

0127

6329

2425

3925

4229

3721

6123

5029

22 2164

2304

2660

2363

2206

2481

2274

2233

2435

2449

2572

2740

2122

2818

2440

2720

2778

2592

2885

2615

2721

2145

2240

2775

2751

2123

2769

2133

2246

2761

2951

2222

2665

2650

2994

2712

2101

2119

2759

2264

2727

2757 21

8427

3220

5325

9620

9928

6720

7623

3125

5721

8723

9922

0324

4127

2529

2726

9428

4220

8722

2622

2024

5524

6629

9626

4228

5727

9829

6725

8125

9123

0124

8624

4226

4022

8925

1023

2525

8829

8223

4025

4026

2320

5723

4626

2420

8921

4824

9921

7522

6827

0722

1121

9024

4528

4325

6827

4620

6727

6522

2428

4823

9426

8026

4422

0924

6425

3628

9621

1822

2825

2725

7324

0522

3622

3823

9823

7027

4223

6127

9321

1026

1622

9522

0425

6727

4322

2323

5722

5427

8228

2621

1723

2327

9725

5626

8926

2727

1421

7028

6822

7329

3122

3925

0222

6623

3528

2420

6123

5921

5920

6525

9321

9223

3627

6624

8425

7726

5429

9720

8322

4221

4129

20 2794

2069

2589

2799

2943

2171

2729

2916

2534

2212

2287

2952

2977

2202

2321

2874

2319

2941

2261

2495

2178

2621

2902

2404

2656

2438

2517

2205

2475

2460

2532

2704

2278

2403

2582

2258

2613

2330

2419

2488

2073

2696

2676

2194

2636

2216

2144

2366

2923

2200

2243

2320

2774

2666

2590

2801

2605

2966

2210

2054

2165 2339

2104

2858

2702

2897 21

5428

4720

8121

1122

7921

5224

9024

3327

3526

2620

8224

6929

0127

0129

7127

8528

6625

1927

2229

4428

8229

0622

0826

6727

9024

6825

2828

7923

5426

9324

6327

1829

9923

7926

3725

5025

0926

0027

8728

9225

4126

6321

9727

6423

77

020

4060

8010

012

014

0

Dendrogram of agnes(x = dat, method = "complete")

Agglomerative Coefficient = 0.99dat

Hei

ght

80

Page 81: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Hierarchical Clustering on Artificial Dataset

124 760

216

803 10

392

694

395 70

505

547

153

890

899

5123

528

918

011

329

433

070

924

669

114

33 64 199

97 525

131

59 185 6

211

855

438

940

948

213 300

826

664

698

774

713

6551

060

767

273

982

286

626

841

0 393

667

29 322

388

545

719

136

721

469

375

736

276

579 22

451

530

747

212

627

365

541

399

406

650 49

032 502

497

240

620

804

516

133

615

158

612

179

454 32

6 9361

465

552

963

9 217

662

174

386

608

383

820

162

140

586 398

828

242

54 68 338 73

784

817

357

875

170

205

1853

671

136

032

083

512

335

650

817

371

528

730

638 632

509

567

4774

485

226

277

326

413

552

475

119

237

783

784

959

071

016

057

175

680

979

268

673

0 20 373

114

49 278

402

492

550

692

430

647

41 690

478

839

80 200

423

794

762

109

661

249

257

440

447

896

186

573

884

228

362

12 334

677

771

723

198

218

856

288

417

786 85

333

246

637

146

518

751

567

346

464

348

555

590

079

880

087

127 638

318

569 4

3772 244

384

539

441

157

666

498

361

772

111

561

337

138

581 51

934

259

334

843

983

275

3 543

229

462

491

565

414

633 64

942

759

589

5 271

411

263

178

164

813

414

576

782

0 37 336 50

467

605 45 407

5536

648

342

273

725

476

440 231

488

139

598

611

816

891 92

526

528

461

843

501

151

506

695

765

358

419

459 49

653

239 319

177 98

148

712

116

120

283

522

493

374

572

105

307

325

566 25

55

438

484

446

583

684

732

761 7

211

298

301

894

479

635 53

270

646

597

728

534

665

888

35 704

214

265 54

623

044

283

043

148

687

3 518

625

303

752

1742

443

678

379

729

663

669

982

359

661

888 193

233

310

777

382

448

514

591

628

142

378

433

892

274

397

845 75

922

253

355

672

486

286

416

838

560

976

627

762

940

327

937

676

888

757 779

352

795

147

619

172

236

99 209

161

162

347

471 89 195

477

512

808

178

426

367

194

793

313

329

261

292 58 90 341

252

312

697

141

227

140

582

599

680 69

401

671

869

125

408

846

108

675

678

458

889

882

651

842

742

748

247

806

637

626

812 49

452

774

386

554

2 970

786

7 4829

047

380

248

082

7 15 164

782

841 43 132 74

526

646

845

668

768

8 9123

231

64

570

790

285

188

720

589

460

810

238

453 52 370

202

848

576

102

617

444

778 36 42 657

163

538

321

780

603

293

652

859 77 335

674

755

343

741 22

310

331

417

177

685

512

463

445

719

778

727

329

581

455

160

273

880

130

548

142

912

858

669

676

380

751

364

420

844

539

458

434

575

7 84 658

143

418

380

149

653

400

716

304

702

452 27

215

943

287

759

417

620

654

058

022

660

470

571

728

233

343

484

4 847

451

725 557

750 44 79

455

552

196

559

746

813

718

74 701

886

76 152

796

221

354

872

885 82 184

130

339

670

78 881

346

106

260

284 31

167

299

349

656 63 706

588

175 95

237

703

700

256

600

396

769 46 722

117

165

210

391

840

28 893

585

126

81 269

154

311

897

450

549

521

531

383

878

682

833

587

785

511

592

874

75 234

308

291

861

683

879

642

731

144

372

693

253

323

281

563

359

548

286

499

805 11 351

207

297

368

898

119

854

564

16 500

155

847

263

239

379

443

412

834

219

553

640

271

870

660

26 503

259

258

5621

564

121

349

552

088

057

573

561

675

465

915

686

057

779

182

178

866 225

328

415

601

122

851

758

129

624

863

317

420 48

910

020

478

955

877

581

881

988

310

116

9 146

166

679 7

4015

046

356

885

782

4 623

1924

852

333

134

081

5 560

85 850

110

606 8

3123 476

537

578

104

191

487

428

858

67 369

645

654

183

613

315

507

535

190

449

121

251

363

811

669

344

355

504

668 21

250

475

544 83 94 127

868

685

181

825

220

364

404

630

799

413

689

87 189

681

224

411

302

733

472

836

203

34 663 61 829

115 30

913

724

3 390

245

727

267

676

381

421

734

729

622

324

876

574

725 20

7028

2196 435 47

0 1451

353

60 726

350 10

724

142

538

76

708

1038

1610

1120

1289 1

922

327

1430 7

118

2 275 280

610

1248

416

749

3016

8310

3411

6912

5112

8814

3318

65 1924

939

1209

2041 94

311

54 999

1665 10

8116

44 101

010

3317

4114

6018

8914

1413

1013

7720

0416

6690

311

1717

9616

8815

4910

8217

5616

2918

3695

215

4310

4110

7215

3518

0318

57 907

1331

1544

1818

1198

1983

1232

1360

1416

1674

2003 16

0593

813

3310

2512

8418

5213

4417

2411

6517

8315

1919

9018

6294

011

3513

6514

0298

620

0217

2616

9918

05 1631

1167

1266

1869

1989

1323

1391

1179

1776

2015 12

40 1739

909

2046

1305

1417

1193

1354

1412

1693

1993

1539

1608

1031

1073

2039

1191

1461

1427

1446

1686

929

1846

1754

1296

1613

1660

2009

1210

1358

1785

1534

1808

1243

1727

2047

1973

1027

1619

1948

1393

1997

1075

1814

1139

1317

1079

1943

1547

1905

1121

1250

1394

1473

1521

1703

1617

1738

1596

1112

1953

1252

1396

1301

1350

1750

1496

1819 11

7710

2018

5019

7513

6615

9017

5318

0711

3715

89 164

815

2519

3012

0215

8817

4518

5312

9414

35 1408

1401

2000

1951

1618

1442

1772

901

1748

1527

1737

1129

1917

1287

1462

1625

1472

1893

1309

1620

1832

1002

1655

1944

1731

1546

1173

1696

1978

1498

1425 98

510

6413

2119

7015

4014

9020

1210

9711

3114

4312

5812

9120

4811

6015

5318

6120

2192

810

8613

7619

0217

6020

2219

4512

1214

4116

4318

8896

210

1415

3113

2613

9916

4918

92 930

1100

1275

2010

1264

1449

1316

1476

2033

1180

1359

1203

1207

1306

1969

1478

971

1049

1133

1458

1637

1140

1231

1245

1375

1403

1488

1743

1961 20

1692

212

5412

3319

5593

514

3711

9513

3619

9517

8813

6193

115

6310

0411

4610

3017

4218

7513

5294

618

8610

1710

4418

3012

7012

9291

219

6011

3214

4416

8719

2719

0615

0320

0813

0814

7914

4815

1015

5194

817

8113

4814

3415

9110

4312

7117

1813

8010

8919

2610

0012

1816

72 1684

1016

1600

1538

1842

1968

1920 91

111

8896

015

6013

5717

6197

214

8716

1499

210

7112

8696

413

3410

4011

5011

6812

8318

7294

498

998

117

6415

6417

5914

2010

3611

0712

2510

2912

9916

04 987

1439

2045

1215

1598

1304

959

1636

1125

1189

1483

1710

1559

1122

1691

1932

1482

1569

1762

1318

1896

2031

1929

1130

1398

1834

1863 18

5914

5618

2819

76 917

1163

1099

1497

1022

1077

990

1213

1706

1217

1641

1656

1035

1851

1111

1954

1470

1653 10

6911

8715

0413

5117

2295

318

00 1200

1235

965

1646 10

2110

1817

0590

418

7198

811

2812

5310

4610

8514

24 994

1747

1397

1407

1647

1982

1159

1697

1713

1912

1716

1856

915

1720

1005

1845

2025

1353

1996 924

1511

1959

1602

1328

1712

1011

1142

1312

1152

1279

1406

1898

1062

1639

1115

1155

1228

1881

1343

1611

2013

1942

1986 16

7395

697

813

3515

1317

2918

2410

6019

7112

0417

3012

5717

9319

8710

0310

9217

0211

0620

3015

1811

8216

9017

89 961

1065

1903

1680 13

5591

614

5714

6618

1511

0812

1619

7913

4519

5211

4111

47 991

1616

1921

1039

1156

1166

1795

1947

1185

1242

1633

1486

1485

1537

1638 97

716

2315

1415

2311

9715

5518

7313

3216

5917

1713

8517

1511

7117

3617

6616

6219

1115

0816

7610

6812

5911

5311

8115

3215

45 926

1042

1958

1313

1459

1879

1362

1802

1936

937

1368

2049

1007

1579

1813

1708

2020

1285

1552

1994

1388

1734

1076

1369

2029

1096

1070

1101

1346

1436

1176

1867

1635

2040

970

1093

1985

1206

1214

1876 19

0110

2616

7713

0016

7911

7813

7010

4716

9412

4114

1314

1911

8612

5516

1218

3712

8212

3713

8113

7912

9514

99 920

1833

1389

1126

1057

2035

1384 99

713

4299

617

8414

6516

5019

5719

4110

8813

41 933

983

2001

1669

1897

1884

936

1827

1006

1791

1798

1172

1621

1367

976

1711

1226

1468

1247

1512

1782

1864

1940

1303

1330

1700

1530

1573 982

1114

1597

1211

1432

1654

1028

1084

1118

1411

1113

1454

1704

1136

1938

1956

1484

1103

1866

1467

1528

1222

1681

1642

1138

1415

1256

1314

1501

1885

1558

1024

1661

1074

1263

1374

1302

1751

1572

1109

1311

1267

1858

1272

913

1162

1260

1707

1778

1515

1804

1878

967

2011

1087

1583

1870 99

510

3719

81 919

1390

1767

1931

1078

1094

1489

1151

1123

1991

1400

1877

2037

1592

1999

1494

1825

963

1725

1949 975

1651

1098

1935

1095

1561

1382

1315

1831

1907

1144

1298

1522

1170

1946

1183

1606

1509

1581

1050

1925

1984

1329

1586

2050

1840

1061

1262

1230

1562

1580

1632

1809 97

411

5710

5818

2619

0814

6498

419

1914

2210

0812

9312

2316

1514

6314

7420

2315

7711

4319

0918

7415

5016

5219

7211

0515

8712

0818

1612

3919

6316

0317

8017

7712

8118

3516

6416

6318

0618

6019

3712

1917

7319

1316

2620

3293

294

910

1594

114

3817

7917

6519

9811

4912

05 1787

1012

1916

1634

1066

1372

1904

1847

1102

1811

1757

1234

1505

1541 90

295

713

1916

6720

1990

516

4512

7317

7518

20 1481

923

1735

2018

954

1124

1593 11

4514

2619

6420

1419

1013

7396

816

7120

1714

7515

3615

7417

33 1387

908

1349

1246

1307

1584

1895

1914

1013

1261

1628

1939

950

1822

1174

1533

1575

1624

2007

1221

1502

1274

1758

1322

1918

934

1769

1923

1055

1578

1529

1967 1746

1059

1268

1890

1630

1855

1340

1585

1383

1378

2024

1455

1506

1395

2043

1854

1054

1848

1839

1339

1567

1405

1965

1548

1104

1974

1493

1882

1280

1868

1682

1453

1740

947

1752 16

7014

2318

1017

3218

2912

6914

2119

34 1009

1161

1692

951

1790

1689 97

916

0915

2616

98 1571

1719

906

1001

1576

1386

1477

1404

1450

1517

910

1083

1849

1053

1801

1844 1

675

927

1192

1841

1090

1966

1891

969

1721

1763

1701

1786

1883

2036

966

1276

1770 9

4519

2819

8820

2711

9613

3714

9115

8218

9417

14 918

1363

1950

1980

1063

1792

1799 97

315

5610

5217

9720

4212

6515

4216

2716

4020

0519

3320

26 980

1277

1392

1119

1622

1880

1080

1356

1812

1418

1977 1244

1771

1469

1507

1492

1838

1695

1570

1657

1220

1668

1774

1325

1899

942

1190

1278

1229

1495

1023

1445

1471

1524

1290

1794

1817

2044

1110

1184

1452

1347

1148

1431

1992

1749

1019

1520

1554

1900

1158

1164

1447

1709

1091

1236

1297

1568

1410

2006 9

5515

0020

28 1224

1201

1723

993

1051

1566

2038

1238

1056

1480 91

417

6820

3410

6715

99 1048

1843

1744

1823 18

2110

4517

55 1116

921

1915

1199

1595

1678 1

327

958

1887

1428

1409

1594

1429

1440

1962

1032

1371 12

2792

516

0199

816

8511

3413

3811

7515

6511

9413

6415

1616

07 1249

1728

1320

1324 15

5711

2716

5856

226

62 2380 2

467

2792

2125

2453

2690 29

3321

2922

4125

9522

4429

1525

5426

9822

9924

5220

5224

3122

9627

1127

13 2077

2105

2833

2886

2454

2674

2606

2768

2912

2167

2333

2193

2776

2231

2107

2186

2385

2975

2218

2939

2715

2326

2395

2618

2668

2291

2526

2909

2386

2972

2678

2748

2779 21

4725

3821

7724

7225

6323

8927

2323

1829

3429

5725

4426

5929

0724

0226

9924

2629

0420

5928

1922

2123

1021

3421

6621

5026

2829

50 2571

2106

2670

2895 25

7822

1522

8827

44 2443

2356

2396

2508

2979

2275

2921

2518

2731

2687

2859

2850

2708 23

2724

9725

6027

3922

8028

8729

6324

8326

8528

04 2448

2926

2932 2

524

2586

2610

2772

2991 21

8328

5326

3129

8325

5128

1720

5125

0729

4824

2024

6125

1620

7921

9927

4523

8722

9826

4628

4126

3027

50 2773

2074

2981

2585

2846

2375

2437

2936 23

0721

8227

7722

2722

9228

8025

7428

0821

2729

1821

6321

8027

4925

6528

3928

0722

5027

5427

71 2332

2367

2688

2871

2056

2784

2601

2845

2949 25

1221

5526

03 2345

2102

2655

2783

2482

2283

2355

2136

2736

3000

2195

2248

2140

2716

2734

2198

2917

2523

2962

2679

2893

2412

2862

2477

2253

2724

2446

2344

2710 29

1920

6829

7421

7922

4723

8424

1721

6029

5522

5928

9924

7025

3323

1424

7327

2623

2429

7828

4026

4829

9526

6126

9129

8727

4721

0026

1925

3726

1221

7229

7025

4728

7729

8928

00 215

727

9123

1328

5121

1621

8924

8027

9621

9628

8923

4726

71 2447

2830

2903

2803

2945

2058

2990

2316

2086

2108

2625

2272

2946

2910

2191

2391

2504

2364

2459

2633 27

4123

1723

6025

5824

2429

1426

1126

5128

2024

3928

7223

9325

8424

3625

1425

7620

6027

3821

4626

0725

6425

4329

1127

3028

0520

6222

9723

4122

0721

2128

1328

8128

8429

6120

6424

7922

5222

3021

8826

7320

7521

6823

3725

5927

9528

3529

0021

2822

7023

5223

9024

7420

8529

8823

7629

9329

5922

5628

4925

0624

7826

4724

4421

1323

8125

0523

0929

8523

6821

5128

2727

1923

7329

6824

23 2958

2063

2255

2545

2103

2814

2829

2392

2575

2569

2865

2953

2112

2229

2143

2378

2511

2823

2174

2810

2548

2675

2816

2131

2652

2692

2908

2088

2657

2282

2604

2682

2091

2162

2450

2956

2109

2269

2411

2271

2728

2225

2758

2302

2434

2371

2496

2587 27

6220

9229

4224

3026

3222

5122

6028

3829

6022

5727

1723

0628

2825

0025

5328

6128

9129

8620

7123

8326

1729

6529

9823

2823

4825

1324

7628

0924

1329

7628

3629

2820

8420

9825

4628

7022

4524

7121

5824

2922

7629

2924

2820

9525

4922

6524

1026

7728

9821

4227

8125

9728

1520

9023

2924

8523

1123

5120

9623

0028

1128

5622

0123

3424

0822

8624

2126

0924

1627

0026

9528

7522

1923

4923

7428

5528

64 2388

2235

2277

2487

2753

2566

2080

2137

2562

2531

2130

2322

2425

2173

2780

2232

2290

2683

2760

2293

2458

2984

2362

2789

2522

2684

2913 26

0821

8122

1422

3729

8023

0322

9427

3327

7020

5325

9624

6629

9626

4228

5727

9829

6720

7623

3125

5726

9420

9928

6721

8723

9922

0320

8722

2622

2024

5525

8125

9120

8224

6929

0129

7127

0128

6625

1927

2229

4427

8528

8220

9427

5628

3124

2225

5227

6724

5728

9028

9424

9426

3429

0526

4326

6420

5421

65 2339

2104

2858

2702

2897

2081

2111

2279

2152

2490

2433

2735 26

2622

0826

6727

9024

6825

2828

79 2906

2354

2693

2463

2718

2999

2379

2637

2550

2509

2600

2787

2892

2541 2

154

2847

2197

2764

2377

2055

2120

2938

2825

2860

2169

2925

2149

2407

2213

2489

2672

2706

2806

2115

2697

2649

2954

2620

2888

2837

2869

2066

2703

2281

2629

2078

2284

2139

2602

2465

2521

2935

2409

2414

2964

2788

2832

2614

2737

2418

2947

2530

2097

2658

2669

2709

2878

2973

2122

2440

2720

2778

2818

2751

2592

2885

2615

2721

2123

2769

2133

2267

2400

2570

2786

2529

2305

2308

2930

2093

2641

2126

2397

2456

2752 21

6422

0624

8122

7422

3324

3524

4925

7227

4023

0426

6025

3923

6321

3828

3421

8522

3423

38 2583

2501

2763

2542

2937 29

2421

0121

19 2759

2264

2145

2240

2775 21

6123

5029

2226

5029

9427

1226

6520

5723

4626

2421

4824

9920

8921

7522

6827

0722

1121

9024

4528

4321

1822

2825

2725

7324

0522

3622

3823

7027

4223

9820

6727

6522

2428

4825

6827

4623

9426

8026

4422

0925

3628

9624

6423

6127

9321

7826

2129

0224

1924

0426

5624

3825

1723

1929

41 2488

2205

2475

2460

2532

2704 25

8222

7824

0322

4627

6129

51 2222

2114

2135

2312

2645

2822

2863

2156

2353

2170

2868

2273

2239

2502

2931

2223

2357

2266

2335

2824

2556

2689

2627

2714

2117

2323

2797

2124

2515

2249

2561

2492

2132

2358

2705

2217

2844

2263

2451

2802

2406

2525

2638

2262

2876

2873

2462

2639

2491

2594

2315

2812

2382

2401

2498

2755

2415

2681

2992

2343

2635

2365

2622

2686

2369

2535

2653

2427

2579

2852

2883

2493

2969

2061

2359

2654

2997

2336

2766

2441

2484

2577

2725

2927

2301

2486

2442

2640

2065

2593

2192

2159

2204

2567

2743

2842

2110

2616

2295

2254

2782

2826

2258

2613 23

3020

8322

4221

4129

20 279

422

8925

1023

2525

8829

8223

4025

4026

2320

6925

8927

9929

4321

7127

2929

16 2534

2202

2321

2874 22

6124

9522

1222

8729

5229

7720

7326

96 2676

2194

2636

2216

2144

2366

770

2940

2072

2285

2598 23

7221

5323

4225

0325

20 2432

2599

2176

2580

2732

2555

2854

2184

2727

2757 22

0022

4323

2027

74 2666

2590

2801

2605

2966

2210

2923

2663

020

4060

Dendrogram of agnes(x = dat, method = "average")

Agglomerative Coefficient = 0.99dat

Hei

ght

81

Page 82: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Using Dendograms

I Different ways of measuring dissimilarity result in different trees.I Dendograms are useful for getting a feel for the structure of

high-dimensional data though they don’t represent distances betweenobservations well.

I Dendograms show hierarchical clusters with respect to increasing valuesof dissimilarity between clusters, cutting a dendogram horizontally at aparticular height partitions the data into disjoint clusters which arerepresented by the vertical lines it intersects. Cutting horizontallyeffectively reveals the state of the clustering algorithm when thedissimilarity value between clusters is no more than the value cut at.

I Despite the simplicity of this idea and the above drawbacks, hierarchicalclustering methods provide users with interpretable dendograms thatallow clusters in high-dimensional data to be better understood.

82

Page 83: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Hierarchical Clustering on Indo-European Languages

Alb

ania

n_G

Alb

ania

n_C

Alb

ania

n_K

Alb

ania

n_T

Alb

ania

n_T

op

Gyp

sy_G

kS

ing

hal

ese

Kas

hm

iri

Nep

ali_

Lis

tK

has

kura Hin

di

Pan

jab

i_S

TL

ahn

da

Ben

gal

iM

arat

hi

Gu

jara

tiA

fgh

anW

azir

iO

sset

icW

akh

iB

alu

chi

Per

sian

_Lis

tT

adzi

k Gre

ek_K

Gre

ek_D

Gre

ek_M

od

Gre

ek_M

LG

reek

_MD

Arm

enia

n_M

od

Arm

enia

n_L

ist

HIT

TIT

ET

OC

HA

RIA

N_A

TO

CH

AR

IAN

_B Iris

h_A

Iris

h_B

Wel

sh_N

Wel

sh_C

Bre

ton

_Lis

tB

reto

n_S

EB

reto

n_S

TL

atvi

anL

ith

uan

ian

_OL

ith

uan

ian

_ST

Ukr

ain

ian

Bye

loru

ssia

nP

olis

hR

uss

ian

Lu

sati

an_L

Lu

sati

an_U

Cze

ch_E

Cze

chS

lova

kM

aced

on

ian

Bu

lgar

ian

Slo

ven

ian

Ser

bo

cro

atia

nIc

elan

dic

_ST

Far

oes

eS

wed

ish

_Lis

tS

wed

ish

_Up

Sw

edis

h_V

LD

anis

hR

iksm

alE

ng

lish

_ST

Tak

itak

iG

erm

an_S

TP

enn

_Du

tch Fri

sian

Fle

mis

hD

utc

h_L

ist

Afr

ikaa

ns

Ro

man

ian

_Lis

tV

lach

Sar

din

ian

_NS

ard

inia

n_L

Sar

din

ian

_CF

ren

ch_C

reo

le_C

Fre

nch

_Cre

ole

_D Pro

ven

cal

Fre

nch

Wal

loo

n Sp

anis

hP

ort

ug

ues

e_S

TB

razi

lian

Cat

alan

Ital

ian

Lad

in

0.0

0.2

0.4

0.6

0.8

1.0

83

Page 84: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-meansPartition-based methods seek to divide data points into a pre-assignednumber of clusters C1, . . . ,CK where for all k, k′ ∈ {1, . . . ,K},

Ck ⊂ {1, . . . , n} , Ck ∩ Ck′ = ∅ ∀k 6= k′,K⋃

k=1

Ck = {1, . . . , n} .

For each cluster, represent it using a prototype or cluster centre µk.We can measure the quality of a cluster with its within-cluster deviance

W(Ck, µk) =∑

i∈Ck

‖xi − µk‖22.

The overall quality of the clustering is given by the total within-clusterdeviance:

W =

K∑

k=1

W(Ck, µk).

The overall objective is to choose both the cluster centres and allocation ofpoints to minimize the objective function.

84

Page 85: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means

W =

K∑

k=1

i∈Ck

‖xi − µk‖22 =

n∑

i=1

‖xi − µci‖22

where ci = k if and only if i ∈ Ck.I Given partition {Ck}, we can find the optimal prototypes easily by

differentiating W with respect to µk:

∂W∂µk

= 2∑

i∈Ck

(xi − µk) = 0 ⇒ µk =1|Ck|

i∈Ck

xi

I Given prototypes, we can easily find the optimal partition by assigningeach data point to the closest cluster prototype:

ci = argmink‖xi − µk‖2

2

But joint minimization over both is computationally difficult.

85

Page 86: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-meansThe K-means algorithm is a well-known method that locally optimizes theobjective function W.Iterative and alternating minimization.

1. Randomly fix K cluster centres µ1, . . . , µK .2. For each i = 1, . . . , n, assign each xi to the cluster with the nearest centre,

ci := argmink‖xi − µk‖2

2

3. Set Ck := {i : ci = k} for each k.4. Move cluster centres µ1, . . . , µK to the average of the new clusters:

µk :=1|Ck|

i∈Ck

xi

5. Repeat steps 2 to 4 until there is no more changes.6. Return the partition {C1, . . . ,CK} and means µ1, . . . , µK at the end.

86

Page 87: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means

Some notes about the K-means algorithm.

I The algorithm stops in a finite number of iterations. Between steps 2 and3, W either stays constant or it decreases, this implies that we neverrevisit the same partition. As there are only finitely many partitions, thenumber of iterations cannot exceed this.

I The K-means algorithm need not converge to global optimum. K-meansis a heuristic search algorithm so it can get stuck at suboptimalconfigurations. The result depends on the starting configuration. Typicallyperform a number of runs from different configurations, and pick bestclustering.

87

Page 88: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Crabs

Looking at the Crabs data again.

library(MASS)library(lattice)data(crabs)

splom(~log(crabs[,4:8]),col=as.numeric(crabs[,1]),pch=as.numeric(crabs[,2]),main="circle/triangle is gender, black/red is species")

88

Page 89: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Crabscircle/triangle is gender, black/red is species

Scatter Plot Matrix

FL2.62.83.03.2

2.62.83.03.2

2.02.22.42.6

2.02.22.42.6●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●● ●●●●

●●

● ●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●● ●●●●●●●●

●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

RW2.42.62.83.0

2.42.62.83.0

1.82.02.22.4

1.82.02.22.4●

●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●● ●

●●●●●

●●●●●●●●●●●●●●

●●●●● ●●●●●●●

●●●

●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●● ●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

CL3.43.63.8

3.43.63.8

2.83.03.2

2.83.03.2●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●● ●●●●

●●● ●●

●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●

CW3.43.63.84.0

3.43.63.84.0

2.83.03.23.4

2.83.03.23.4●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●

●●●●● ●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●

●●● ●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

BD2.5

3.02.5 3.0

2.0

2.5

2.0 2.5

89

Page 90: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Crabs

Apply K-means with 2 clusters and plot results.

cl <- kmeans( log(crabs[,4:8]), 2, nstart=1, iter.max=10)

splom(~log(crabs[,4:8]),col=cl$cluster+2,main="blue/green is cluster finds big/small")

90

Page 91: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Crabsblue/green is cluster finds big/small

Scatter Plot Matrix

FL2.62.83.03.2

2.62.83.03.2

2.02.22.42.6

2.02.22.42.6 ●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●● ●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●

● ●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●

●●●●●●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●● ●●●●●●●●

●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●

●●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

RW2.42.62.83.0

2.42.62.83.0

1.82.02.22.4

1.82.02.22.4●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●● ●

●●●●●

●●●●●●●●●●●●●●

●●●●● ●●●●●●●

●●●

●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●

●●●

●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●● ●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

CL3.43.63.8

3.43.63.8

2.83.03.2

2.83.03.2●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●● ●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●● ●●

●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●

CW3.43.63.84.0

3.43.63.84.0

2.83.03.23.4

2.83.03.23.4 ●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●

●●●●● ●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●● ●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●

● ●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

BD2.5

3.02.5 3.0

2.0

2.5

2.0 2.5

91

Page 92: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Crabs

‘Whiten’ or ‘sphere’1 the data using PCA.

pcp <- princomp( log(crabs[,4:8]) )spc <- pcp$scores %*% diag(1/pcp$sdev)splom( ~spc[,1:3],

col=as.numeric(crabs[,1]),pch=as.numeric(crabs[,2]),main="circle/triangle is gender, black/red is species")

And apply K-means again.

cl <- kmeans(spc, 2, nstart=1, iter.max=20)splom( ~spc[,1:3],

col=cl$cluster+2, main="blue/green is cluster")

1Apply a linear transformation so that covariance matrix is identity.92

Page 93: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Crabs

circle/triangle is gender, black/red is species

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●●● ●● ●●● ●

●●● ●●

●●●

●●

●●●

● ●●●●●

● ●●●●● ●●●

●● ●● ●●●●●●

●● ●●●

●● ●●●●●●

●●●●●

● ●●●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●●●●●●

●●●● ●

●●●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●●

●●●●●

●● ●●●●●● ●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

20 1 2

−2

−1

0

−2 −1 0

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

● ●●●

●●

●●

●●●● ●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

blue/green is cluster

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●

●●

● ●● ●●

●●● ● ●●● ●●●●●● ● ●●●

● ●●● ●●● ●●●●● ●●●●

●● ●●● ●

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●●● ●● ●●● ●

●●● ●●

●●●

● ●●●●

●●●

●●● ●●● ●

●●●●●●

●●●

●●● ●● ● ●●●● ●●● ●●●● ●

●●●● ●

●●

●●

●●●

● ●●●●●

● ●●●●● ●●●

●● ●● ●●●●●●

●● ●●●

●● ●●●●●●

●●●●●

●●●●

●●●●●

● ●●● ● ●●● ●●

● ●●●● ●●● ●●

●●●●● ●●● ●●● ●●

● ●●●●●

● ●●●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●●●●●●

●●●● ●

●●●●

●●●

●●●

● ●●

● ●●●● ●●● ●

●●●●● ●

●●●●●● ●● ●● ●●● ●●●

●●●● ●● ●

●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●●

●●●●●

●● ●●●●●● ●●

●●●

●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●●●

●●

●●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

20 1 2

−2

−1

0

−2 −1 0●

●●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●●

●●●●●●

● ●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●●●

●●

●●

●●

●●●

●●●●●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

● ●

●●●● ●● ●

●●

●●

●●

●●

●●● ●

●●●

●●●

●●●●

●●

●●

● ●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●●●

●●

●●

● ●●●

●●

●●

●●●● ●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

Discovers gender difference...Results depends crucially on sphering the data first.

93

Page 94: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Crabs

Using 4 cluster centers.circle/triangle is gender, black/red is species

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●●● ●● ●●● ●

●●● ●●

●●●

●●

●●●

● ●●●●●

● ●●●●● ●●●

●● ●● ●●●●●●

●● ●●●

●● ●●●●●●

●●●●●

● ●●●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●●●●●●

●●●● ●

●●●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●●

●●●●●

●● ●●●●●● ●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

20 1 2

−2

−1

0

−2 −1 0

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

● ●●●

●●

●●

●●●● ●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

colors are clusters

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●

●●

●●● ●

●●●● ● ●●● ● ●

●●●● ● ● ●●

● ●●● ●●● ●●●●●

●●●●

●● ●

●● ●

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●

●● ●● ●●

● ●●●

● ●●●●●

●●

●●●

●●●

●●● ● ●● ●

●●●●●●

●●●

●●● ●● ● ●●●● ●●

● ●●●● ●

●●●

● ●●●

●●

●●●

●●●●●●

● ●●●●● ●●●

●●●● ●●●●●●

●● ●●●

●● ●● ●●●●

●●●●●

●●

●●

●●●●

●● ●●● ● ●●● ●

●● ●●●● ●●

● ●●●●●●● ●●● ●

●● ●●

●●●●● ●

● ●● ●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●

●● ●●●

● ●●● ●

●●●●

●●

●●●

●● ●

●● ●●●● ●

●● ●●●●

●● ●

●●●●●● ●● ●● ●●

● ●●●●●

●●●

● ●●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●

●●●●●

●●● ●●

●●●● ●●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

2 0 1 2

−2

−1

0

−2 −1 0

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●●●

● ●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

● ●

●●●●

●● ●

●●

●●

●●

●●● ●

●●●

●●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●●●

●●

●●

● ●●●

●●

●●

●● ●●●

● ●● ●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

94

Page 95: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Spike Waveforms

library(MASS)library(lattice)spikespca <- read.table("spikes.txt")cl <- kmeans(data,6,nstart=20)splom(data,col=cl$cluster)

95

Page 96: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means on Spike Waveforms

Scatter Plot Matrix

V11234

1 2 3 4

−3−2−1

0

−3−2−1 0

●●●● ●

●●●●

●●●

●●

●●●●

●●

● ●● ●

●●●●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●●

●●

● ●

● ●

●●●

●●

●●●

●● ●

●●●

●●

●●●●

●●

●●

●●●

●● ●

●●●

●●●●

●●●

● ●●●●●

● ●

● ●●●●

●●

●●

● ●

● ●●●● ●● ●●●

●●●●

●●

●●●

●● ●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

●●●

●●

●●

●●

●● ●●●●

●●●● ●●● ●

●●

●●

●●

●●●●

●●

● ●

●●●●●

●●

●●

●● ●●

●●

●●

●●●●

●●

● ●

●●

●●●●

●●●

●●●●● ● ●●

● ●● ●

●● ●●

●● ●●

●●●

●●

●●●●●

●●

●●●●

●●●

● ●

●●

●●●●

● ●●

●●●●●●

●●

●●●●

● ●

●●●●●

●●●

●● ●●

●●●●

●●

●●

●●

●●

● ●

●●●●

●●●

●●●

●●●●●

●●●●

● ●●●

●●●

●●

●●●●

●● ●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●● ●

● ●● ●

●●

●●

●●●

●●

●●

●●

●●●●●●●●●

●●●●

●●

● ●●●●●

●●

● ●●●●●●

● ●●●

●●●

●●●●

●●

● ●●

●●

●●

●●●●●●●●

●● ●

●●●●

●●

●●● ●●●●

●●

●●●

●●●●●●●

●●●●

● ●●

●●●

●● ●●

●●●

●●

● ●●

● ●

●● ●

●●

●●●●●

●● ●

●●

●●

●●●

●●

●●

●●

●●●●●● ●●

●● ●●●●

●●

●●●●●

●●

●● ●●

●●

●●

●●●●

●●●●●

● ●

●●●

●●

●● ●

●●

●●●●●●

●●●

●●

●●

● ●●

●● ●

●●●

●●●●●

●●

●●●●

●●●

●●

●●

●●

●●●●

●●

●●

● ●●

●●

●●●●●

●●●●●

● ●

●●●●

●●●●●●

●●●

●●

●●●●

●●●●

●● ●

●●

●●

●●

●●●●●

●●●

●● ●●●

●●

●●●●

●●●

●●●

●●

●●●●

●●●●●

●●●●●●●

●●●●●●

●●

●●●●●

●●●●●

●●●

●●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●●

●●

●●●●●

●●

●●

●●●

●●●

●●●

●●●●●●●

●●●●●●

●●

● ●● ●●

●●

●●

●●

●●●●●●●●●●●

● ●●●

●●

●●●●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●●●●

●●●●●●●●

●●

●●

●●

●●●●

●●

●●

●●●●●●

●●

●●

●●●●●●●

●●●●

●●●

●●

●●

●●●●

●●●

●●●●●●●●

●●●●

●●●●

●●●●●●●

●●

●●●●●

●●

●●●

●●●●●●

●●

●●●●●

●●●

●●●●●

●●

●●●●

●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●●●●●

●●●

●●●●

●●

●●

●●●●

●●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●●●●●

●●

●●●

●●●

●●

●●●

●●●●●●●●●

●●●●

●●

●● ●●●●

●●

●●●●● ●●●●●

●●●●●●●●●

●●●

●●

●●

●●●●●

●●●

●●●●●

●●

●●

●●● ●●●●

●●

●●●

●●●●●●●●

●●●●●●

●●

●●●●●

●●●

●●●

●●●●

●●

●●●●

● ●

●●●●●

●●●

●●●●

●●●

●●

●●

●●

●● ●●●● ●●

●●●●●●

●●

●●●●

●●

●●●●

●●

●●

●●●●●●●●

●●

●●●

●●

●●●

●●

●●

●●● ●

●●●

●●

●●●●●

●●●

●●●●

●●●●●

● ●

●●●●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●●● ●

●●● ●●●

●●●

●●

●● ●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●●

●●●●●

●●

●●●●

●●●●●●

●●

●●●●

●●●● ●

●●●●● ●●

●●

●●●●

●●

●●●●

●●●●

●●

●●●●

● ●

●●●

●●●

●●

●●

●●●

●●

● ●

● ●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●●

●● ●

●●●

●●●●

●●●

●●●●●

●●

●●

●●●●●

●●

●●

●●

● ●●●●●● ●●●●

●●●●●●●●

● ●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●●●●●

●●●●● ●● ●

●●

●●

●●

●●●●

●●

● ●

●●●● ●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●●●

● ●●

●●●●●● ●●

●●● ●

●●● ●

●●●●●●

●●●

●●●●●●

●●

●●●●●

●●●●

●●

●●● ●

● ●●

●●●● ●●

●●

●●●●

●●

●●●

●●● ●

●●●●

●●●●

●●

●●

●●

● ●

● ●

●●●●

●●●

●●●

●●●●●

●●●●

●● ●●

●●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●● ●●●●●●

● ●●●

●●

●●●●●●

●●

● ●●●●●●

●● ●●

●●●

●●●●

●●

●●●

●●

●●

●●●●●●●●

●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●●●

●●●●

●●●●● ●

●●

● ●●●●

●● ●●

●●

●●●

●●

●●●

●●

●●●●●

●●●●

●●

●●●

●●

●●

●●

●●●

●● ●●●

●●●●●●

●●

●●●●●

●●

●● ●●

●●

●●

●●●●

●● ●●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●●

●●

●●

●●●

●●●

●●●●

●●●●●

●●

●●●●●

●●

●●

●●

●●

●● ●●

●●

●●

● ●●

●●●

●●●●●

●●●

●●

●●

●●●●

● ●● ●●●

●●●

●●

●● ●●

●●

● ●

●●●

●●●

●●

●●

●●●●●

●●

●●● ●

●●●●

●●● ●

●●●

●●●

●●

●●●●

●● ●● ●

●● ●●

● ●●

●●

●●●●

●●

● ●●●

●● ●●

●●

●●●●

●●

●● ●

●●●

●●

●●

● ●●

●●

● ●

●●

●●●

●●

●●●

●●●

●●●

●●

●● ●●

●●

●●

●●●

●● ●

● ●●

●●●●

●●●

●●●●●

●●

●●

●●●●●

●●

●●

●●

● ●● ●● ●● ●●●●

●●●●

●●

●●● ●●

●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●● ●●●

● ●● ●●● ●●

●●

●●

●●

●●●●

●●

●●

●●● ●●●

●●

●●

●● ●●

●●

●●

●●●●

●●

●●

●●

●● ●●

● ●●

●●●●●● ●●

●●● ●

●●●●

●● ●●

● ●●

●●

●●●●●

●●

●●●

●●

●●●●

●●

●●●●●

● ●●

●● ● ●●●

● ●

●● ●●

●●

●●●

●●● ●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

● ●●

●●●

●●●●●

●●

●●

●●● ●

●●●

● ●

●●●

●●

● ●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●● ●

●●

●●

●●●

●●

●●

●●●

●●● ●●●● ●●

● ● ●●

●●

●● ●●●●

●●

● ●●●●●●

●● ●●

●●●

●●●●●●

● ●●

●●

●●

●●●●●

● ●●

●●●

● ●● ●

●●

●●●●

●●●

●●

● ●●

●● ●●●● ●

●●●●● ●

●●

●●●●●

●●●●

●●

●●●

●●

●● ●

● ●

●●●● ●

●●●

●●

●●

●●●

●●

●●

●●

● ●●● ● ●● ●

●●● ●● ●●

●●● ●●

●●

●● ●●

●●

●●

●●●●

●●● ●●

●●

●●

●●

●●

●●●

●●

●●

● ●● ●

●●●

● ●

●●

●● ●

●●●

● ●●

●● ●● ●

● ●

●● ●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●●●

●●●

● ●

●●

●●● ●

●●●● ●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●● ●

●●

●●● ●●●●●

●● ●●

●●●

●●●

● ●

●●●●

●●

●●

●●●

● ●●●●●

●●

●●●●

●●

●●●

●● ●●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

● ●●●

●●●

●●

●●●

●●●●

●●●

●●●●

●●

● ●

● ●

● ●

●●●

●●●

●●

●●●

●●

●●

●●● ●

● ●●

●●

●●

●●

●●

●●●

●●

●●●

● ●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●● ●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●● ●● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●●

●●● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●● ●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●●●

●●●●

●●●

●●

●●●

●●●●●

●●● ●

●●●●●

●●

●●

●●●●

●●

●●●

●●●●

●●● ●●●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●● ●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●●●●

●●

●●

●●●●

●● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●●●●

●●●●

●●

●●●●●●

●●●●

●●

●●●

●●

●●● ●●

●●● ●

●●

●●

●●

●●●

●●●

●● ●●●

●●●●

●●●●●

●●●●

●●

●●●

●●●● V20

1

2 0 1 2

−2

−1

0

−2 −1 0

●●

●●

●●●

●●●●●●

●●

●●●●

●●

●●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●●

●●

●●

● ●●●

●●●

●●

●●●

●●●●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

● ●

●●●●

●●

●●●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●● ●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●●

●●●

● ●●

●●●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●● ●●

●●●●

●●●

●●

●●●

●●●●●

●●●●

●●●●●

●●

●●

●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

● ●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●●●●

●●● ●

●●

●● ●●●●

●●●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●

●●●

●●●●●●

●●●●●

●●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●●

●●●●●●

● ●

●●●●

●●

●●●

●●●●●● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●●

●●

●●●

●●●●

●●●

●●●●

●●

● ●

●●

● ●

●●●

●●●

●●

●●●

●●

●●

●●● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●● ●●

●●

●●●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●● ●● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●● ●●●

●●

●●●

●●●

● ●●

●●

●● ●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●●●●

●●●●

●●●

●●

●●●

●●●●

●●●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●●

● ●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●

● ●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●● ●

●● ●●

●●

●●

●●●●●

●●●

●●●

●●

●●

●●●●

●● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●●●

●●●●

●●

●● ●●●●

●●●●

●●

●● ●

●●

●●●●●●●● ●●

●●

●●

●●

●●●●●●

●●●●●

●●●●●

●●● ●●

●●●●

●●

●●●

●●●●

● ●

●●

● ●●

●●● ●●●

● ●

●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●● ●

●●●

●●

●●

● ●●●

● ●●

●●

●●●

●●●●

●●●

●●●●

●●

● ●

● ●

●●

●●●

●●●

●●

●●●

●●

●●

●● ● ●

● ●●

●●

●●

●●

●●

●●●

●●

●●●

● ●●

●●●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●●●

● ●

●●●●

●●

●● ●

●●

●●

●●

●●●

●● ●

●●

●●

●●

● ●● ●● ●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●●●●●

●●

●●●

●●●

● ● ●

●●

● ●●

●●

●●●

● ●● ●

●●

●●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●●●

●●●●

●●●

●●

●● ●

●●●●●

● ●●●

●● ●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●●

●●

●●●

● ●

●● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●●●

●●

●● ●

●●

●●

●●● ●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●●

●●●●

●●

●●

●● ●●●

●●●●

●●

●●

●●

● ●●●

●●●

●●

●● ●

●●●●●

●●

●●

●●

●●

● ●

●●

●●●

●●●●●

●●● ●●●

●●● ●

●●

● ●● ●●●

●●●●

●●

●●●

●●

●●● ●●

●●● ●

●●

●●

●●

● ●●●● ●

●●●●●

●●●●

●● ●●●

●●●●

●●

● ● ●

●●●●

●●●● ●

●●●●●● ●●●●●●●●

● ●●●●●● ●● ●●● ●●●● ●●●●●●

●● ●●●●● ●●

● ●●●

●●● ●●

●● ●

●● ●●● ●

● ●●●

●● ●●

●●

●●●

● ●●

●●●

●●●●●● ●●●●● ●●

●●

●●●●●● ●● ●

●●

●●●

●●●

●●●

●●●●●●● ●●●●●

●●

● ●●●

●●●●

●●●●

●● ●● ●●

●● ●●●●●

●●●

●●●

●●● ●●

●●●●●●

●●●

● ● ●●● ●● ●

●● ●●

●●●

●●●●

●●●● ● ●●

●●●●●

● ●● ●●●● ●● ●

● ●●●

●● ●●●●●

● ●●●●

●●●●●●

●●●●●

●●●●●

●●●

●●

●●●

●●

●●

●●

● ●● ● ●●●●

● ●

●●●●●●● ●●●

●●

●●

●●● ●

●●●●●●

●●

●●●

●●●● ●●

●● ●●● ●●● ●● ●●●● ●● ●●●● ●●

●●●●●●●● ●●

● ●●●●●● ●●●●●●●●

●●● ●

●●

●●

●●

●●● ●●

●●●

●● ●●

●●●● ●●●●

●●●●● ●

● ●●●

●●●●● ●●●●●●●●

● ●

●●●●●●●●

● ●●●

●●●●●●●●●

●● ●

●● ●●● ●●● ●

●●●●●● ●●● ●●

●●●

●●●

●●●●

●●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●● ●●●

●●

● ● ●●●●

●●

●● ●●

●●● ●●●● ●●●●● ●●●● ●●

●● ●●● ●

●●

●●●

●● ●●●●

●●●●●●●●● ●●●●

●●

● ●●

●●

●●● ●●● ●●●●●

●●●●●●●

●●●

●●●●

●●

●●

●●● ●

●● ●

●●●●

●●● ●●

●● ●●●

●●●● ●●

●●●●●

●● ●

●● ●

●●

●●●● ●●●

●●● ●●●●●

●●●●●●●●

●●

●●

●●●●

●●●●

●●●

●●●

●●● ●●●●● ●●

●●●●●●

●●●●●●● ●●● ●●

●●

●● ●

●●

●●●●●●●●●●●●

●●●●●

●●

●●●

●●●

●●●

●●● ●●

●●●●●●●●●

●●●● ●

●●

●●

●●

●●●●● ●

●●●●●●

●●●●● ●●●●●

● ●●●●

●●●●

●●● ●●●

●●●●

●●●

●●●●● ●●●●●●●●

●●●● ●●●● ●●●● ●●●●

●●●●

●●●●●●●●

●●

●●●

● ●●●●

●●●●

●●●● ●

● ●●●

●●●●● ●● ●●●

●● ●●

● ●●●

● ●●●●●

●●●● ●●●●

●●●● ●●●●●●●●●

●●

●●●●

●●●● ●●● ●

●●● ●●

●●●● ● ●●●●

●●●● ●

●●●●●●● ●● ●● ●●●●●● ● ●●●● ●● ●● ●●● ●●

●●●●●●●●●

● ●●●

●●● ●

●● ●

●● ● ●● ●

●●●●

●● ● ●

● ●

●●●

●●●

●●●

●●●● ● ●●●●●●●

●●●

●●●●●● ●● ●●

●●●

●●

●●●

●●●●● ●●● ●● ●

●●●

●●

●● ●●

●●● ●

●●●

●●

●● ●● ●●●● ●●●●

●●●

●●●

●●●● ●

●●● ●● ●

●●●

●● ●●●●●●

●●●●

●●●

●●●●

●●

●● ●● ●●● ●●

●● ● ●●● ●● ●●

●●● ●

●●● ●●●●

●●●●●

●●●● ●

●●●

●●●●

●●● ●●

●●●

●●

●●●

●●

●●

●●● ●●● ● ●●

●●

●●● ●●●

●●●●

●●●

●●

●●●●

● ●●

●● ●●

●●

●●

●●●●●●

●●● ● ●●● ●●● ●●●●●●●●● ● ●●

●●●●

● ●●●● ●●

● ●●●●●

● ●●● ●●●● ●●

● ●●●●

●●

●●

●●●●●

●●●●●●

●● ● ●●● ● ●

●●● ●●●●

●●●●

●●●●● ● ●●●

●●●●

●●

●●●●●●

● ●● ●

●●

●●●

●● ●●●●

●●●

● ●● ●●●●

●●● ●

●●●●●●

●● ●●●

●●●●

● ●● ●

●●●●●● ●

● ●●●

●●●

●●

●●

●●●● ●●

● ● ●● ● ●●●●●

●● ●●● ●

●●

●● ●

●●● ●●

●●

●● ●● ●● ● ●● ●● ●

●●●●● ●

●●

● ●●

●●●●●●

●●●●●●

●●● ●●●●

●●

●●●

●●●●● ●● ●● ●●●●

●●●

●●●●

●●●● ● ●

●● ●

●●

●●● ●

●●●

●●●●

●●●●●

●●●●●

●●●●●

●●●

●●●

●● ●

●●●

●●

●●●● ●●●

●●●●●

● ●●

●●● ●● ●●●

●●

●●

● ●● ●

● ● ●●

●●●

●●●

●●●●●● ●●●●

●●●●●

●●●● ● ●

●●●● ●●●

●●

●●●

●●

●●●

●● ●● ● ●●

●●

●●●●●●●

●●●

●● ●

●● ●

●●● ●●●

●●●●● ● ● ●

●●●● ●

● ●

●●

●●

●●● ●●●●●●

●● ●●

●●●

●●●●●●

● ●●●●

●●● ●

●●●●●●

● ●●●

● ●●

●●●●●● ● ●●

● ●●●●

●● ●●●● ●●● ●●●● ●●●

●●●●

●●●●●● ●

●●

●● ●

●●●●●●●●●●

●●●●●

● ●●●●●

●● ● ●●●●●

●●●●

●● ●●

● ●● ●●●

●●●● ●●●●

●● ●●●●●● ●●●●●●

●●●● ●●

●●●

●● ●●

●●● ●●

●●●●

V30

50 5

−10

−5

−10 −5

● ●●● ●●

●●● ●●●●●●● ●● ●

●● ●●●●●●● ●●●●●●●● ●●● ●●●●●●●● ●●●●●

●●●●●● ●

●● ●

●●●●●●

●●●●

●●● ●

● ●

●●●

●●●

●●●●●● ●●●●

●●●●●

●●

●●

●●●●●●● ●●

●●●

●●

●●●

●●●●●●●●● ●●

●●●●●

●●●●

●●● ●

●●●

●●

● ●●●●●● ● ●●●●

●●●

●●●●●●●●●●●● ●●

●●●

● ●●●

●●● ●●

●●●●

●●● ●

●●

●●

●●●● ●●●●●

●●●●●●●● ●●

●●●●

●●●●●●●

●● ●●●

● ●●●●

●●●

●●●●

● ●● ●●

●●●

●●

●● ●

●●

●●●

●●●●●● ●●

●●

●●●●● ●

●●●●

●●

●●

●●●●●●

●●●●

●●

●●●

●●●●●●

●●●●●

●● ●●● ●●●● ●● ●● ●●●●●

●●●● ●● ● ●●

●●●● ●●

●● ●●●●●●●●●●●

●●

●●

●●

●●●●●

●●●●●●

●●●●●● ●●

●●●●

●●●●●

●●●

●●●● ●● ●●●●

●●●

●●

●●● ●●●

●●●●

●●

●●●●●●●●●

●●●

●●●●●●●●●● ●●

●●●●●●●●

●●●●●

● ●●●

●●●●●●●

●● ●●

●●●●

●●●

●●● ●●

●●● ●●●●● ●

● ●

● ●●●●●

●●

●● ●

●●● ●●●●

●●●●●

●●●●●● ●●

●●●●●●●

● ●●

●● ●●●●

●●● ●●●●

●●● ●●●●

●●

● ●●

●●

●●● ●●●● ●●●●

● ●●

●●●●

●●●

●● ●●

●●

●●

●●● ●

●●●●●●

●●●●●

●●●●●

●●

●●●●●●●●●●

●● ●●

●●●●

●●●● ●●●

●●●●●● ●●

●●

● ●●●●●●

●●

●●●●

● ●●●●

●●●●●

●●●●●●

●●●●●● ●●●

●●●● ●●

●●●●●●●

●●●

●●●

●● ●

●● ●●●●

●●●

● ●●●●

●●

●● ●

●● ●

●● ●

●●● ●●●●●● ●●●●●

●● ●●●

●●

●●

●●

●●●●●●

●●●●

●●●

●●●

●●●● ●●

●●●●●

●●●●

●●● ●●●

●●●

●●●

●●●

●●● ●●● ●●● ●●●●●●●●●●●● ●●●●●●

●●●●

●● ●

●●●●●●●

●●●

●●● ●●

●●● ●●

●●● ●●

●●●●

● ●●●●●●●●

●●

● ●●

●●●●●●●●

●●

●●●● ●● ●●●●●●●●●● ●●●

●●●●●●

●●●

●●●●● ●●

●●● ●●●●●● ● ●

● ●●●

● ●● ●

●●● ●●● ●● ●●● ●●●●●●●●

●●● ●●●● ● ●●●●●●●●●●●●● ●●●●

●● ●●●

●●●

●●●●● ●

● ●●●

●● ●●

●●

●●●

● ●●

●●●

●●●●● ● ●●● ●●●

●●

●●

● ● ●● ●● ●●●●

●●●

●●

●● ●

●●●●●●●● ●●●

● ●●

●●

● ●●●

●●● ●

●●●●

●● ● ●●●

●●● ●● ●●●●

●●

●●●

●● ●●●●●● ●●

●●●●● ●●

●● ● ●●

● ●●●

●●●●

●●

●●

●● ●● ●●●●●

● ●● ●●●●● ● ●

●●●●

●●●●●●●

●●●●●

●●●●●

●●●

● ●●●

●● ●●●

●● ●

●●

●● ●

●●

●●

●●

● ●●● ●● ●●

●●

●●●● ●●● ● ●

●●

●●

●●●●

●●● ●

●● ●●

●●●

●●

●●● ● ●●

●●● ●●

●● ●● ● ●● ●● ●● ●● ●●● ●●

●●●● ●

● ●●●●

● ●●●● ●

● ●●● ●●● ●●●

●●●●●

●●

●●

● ●●●●

●●●

●●●●

●● ●● ●●●●

●●●● ● ●

●●●●●

●● ● ●●● ●●●

●● ●●

●●

● ●●●●●

●●● ●●

●●

●●●●●●●●

●●

●●● ●●● ●●● ●●

●●●●● ●● ●●●●

●●●

●●● ●

●● ●●●●

●●●● ●

●●●

●●

● ●

●●●● ●

●●● ●● ● ●●●

● ●

●●●●● ●

●●

●●●●

● ●●●●

●●● ● ●●

●●●● ●●●●

●●●● ●●●●

● ● ●

●● ●●●●

●●● ●●●

●●●● ● ●●

●●

● ●●

●●

●● ●●● ●● ●●●●

● ●●

●●●●

●●●

●●●●● ●

●●

●● ●●

●●●

●●●●

● ●●●●●

● ● ●●

●●● ●●

●●●

●● ●

●●●

●●●

●●

●● ●●●●●

●●●●●

● ●●

●●

●●●●●●●

●●

●●● ●

● ●●●

●●●

●●●

● ●●●●●

● ●● ●●● ●●●

●●●●●●

●● ●

●●● ●●

●●

● ●●

●● ●

● ● ●●●● ●

●●

● ●●● ●●

●●

●●●

●● ●

●●●

●● ●●● ●●●

●● ●● ●●●

●●●● ●

● ●

●●

●●

●●● ●●●

●● ●●

●●●

●●●

● ●●● ●●

● ●●●●

●●●●

●● ●●●●

●●●

●● ●

●●●

●●●●● ●●

● ●●●●

●●●●● ●● ●●●●● ●●●

●●●●

●●●

●● ●●●●●

●● ●

●●●●●

●●● ●

●●●●●

●● ●●

●●●●● ●● ●●

●●

● ●●

● ●●●

● ●●● ●●

●●●

●● ●●●● ●●●●● ●● ●●●

●●●

●● ●● ●

●●●

●●●●●

● ● ●●●

●●●●

●●

●●

● ●●●●

● ●

●●●●● ●●

●● ● ●●

●● ●

●●●●

●●● ●●●

●●

●●●

●●●

●●

● ●●

●●

●●●●

●● ●● ●●●

●● ●

●●

●●

●●

●●

●●

●●

●●● ●●●●

●●● ●●

●●●●●●

●●

●●●

●●●

●●●●●

●●●

●●

●●●●

●●●●

●●●●●

●●●

●●

●● ●● ●

●●●●●

●●●

●●●● ●●●●●●

●●●

● ●●

● ●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

● ●●

●●●● ●

●●

●●●●

●●

●●●● ●●

● ●●

● ●●●●

●●

●●●●

●●

●●●

● ●

●●●

●● ●

●● ●● ● ●

●●

●●●

●● ●●

●●

●●

●●

●●●

●●●●

●●●

●●●●●

●●●● ●●

●●

●●●

●●

●●

●●●●

●● ●●

● ●●●

●●

●●

●●●

●●●● ●●●

●●●●●

● ●●

●●●●●

●●● ●●●●

● ●●

●●

●●

●●●

●●●

●●

●●

● ●●●

●● ●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●●

●● ●●

●●

●●

● ●●●

●●● ●

●●

●●●

●●●

●●

●●

●● ●●

●● ●

●●●

●●

●●●

●●●

● ●● ●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●● ●● ●●

●● ●

●●●

● ●●●●

●●

●●● ●●

●●

●●●●

●●●●

●●

●●

●●●●

● ●●●

●●●●

●●

●●●

● ●●

●●●

●●

●●●● ●

●●●●●●

●●●●●

●●●

●●●

●●

●●●●

● ●●●●●●●●

●●

●●●●●●

●● ●●●●●●

●●

●●

● ●●

●●●●●●●

●●● ●●●

●●

●●●

●●●●

●●●

●● ●●●●

●●

●● ●●

●●● ●●●

●● ●●●●●

●●●●

●●

●●●

●●

●● ●●

●●

●● ●

● ●●●

●●

●●

● ●●●●●●

●●●●

●●

●●

●●

●●

●● ●●

●●●●●

●●

●●

●●● ●

● ●●

●●

●●●●

● ●

●●

●●●● ●●●

●● ●●●

●●●

●●●

●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●● ●●●

●●

●●

●●

●●

●●●●●●●● ●●

●●●●

●● ●●●●

●●●●

●●●

●●●●

●●

●●

●●●●

● ●●● ●● ●● ● ●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●● ●●●●

●●●●●● ●●

●●

●●●

●●

●● ●●●●● ●●●● ●

●●

●●

●●●●●

● ●

●●●●●●

●●

● ●● ●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●

● ●●

●●

●●●●

●● ●●●●●

●● ●

●●

●●

●●

●●

●●

●●

●● ●●●●

●● ●●●

●●●●●

●●

●●●●●

●●●

●●●●

●●●

●●

●●●●

●● ●●

● ●●

●●●

●●

●●

●●● ●●

●●● ●

●●●● ●●●● ●●●●

●●●

● ●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●● ●●

●●

●●

●●

●●

● ●●●●●

●●●

●●●●●

●●

●●●●

●●

●●●

● ●

●● ●

●●●

●● ●●● ●

●●●

●● ●

●●●●

●●

●●

●●

●● ●

●●

●●●●

● ●●●●

● ●●●●●

●●

● ● ●

●●

●●

●●●●

●● ●●●●●●

●●

● ●

●●●

●●●

● ●●●

●●●● ●

●●●

●●●●●

●●●● ●

●●

●●●

●●

●●

●●●

● ●●

●●

●●

●● ●●

●●●

●●●

●●

●●

●●●

● ●●●

●●

●●

●●

● ●●

●●●●

● ●●●

● ●●

●● ●●

●●

●●

●● ●●

●●●●

●●

●●●

●●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●●

● ●●

●●

● ●●

●●

●●

●●

●●●● ● ●

●●●

●●●

●●●●●

●●

●●●●●

●●

●●●●

●● ●●

●●

●●●●●●

● ●●●

●●●

●●

●●●

● ●●

●●●●●

●●● ●●

●●●●●

● ●●●●

●●●

●●●

● ●

●●●●●●●●●●●●●

● ●

● ●●●●●

●●●● ●●●●

●●

●●

●●●

●●●●●

●●

●●●●●

●●

●●●

● ●●●

●●●

●● ●●● ●●

●●

●●●●

●● ● ●●●

●●●●●

●●●

● ●●

●●

●●●

●●

●●●●●

●●

● ● ●

●●●●

●●

● ●

●●● ●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●● ●●● ●

●●

●●● ●

●●

●●●●

● ●●●●

●●●●●

● ●●

●●●

●●● ●

●●

●●●● ●

●● ●

●●

●●

●●

●●

●● ●●● ●●

●●

●●●

●●

●● ●●●●●●●●

●●●●●

●●●●●

●●

●●

●●●

●●●●

●●

●●

●● ●●

●●●

● ●●●●● ●●●

●●●

●● ●

●● ●

●●

●●

●●

●●●

● ●●●●●●

●●●●●●●●●

●●

●●●●

●● ●●●●●●

●●● ●

●●

●●

●●●●●●

●●

●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●

●●●●

●●

●●●●

●●●● ●

●●●●●●●

●●●

●●●●

●●

●●

●●

●●

●●●●●●

●●●●●

●●●●●●●

●●

●●●

●●●

●●●●●

●●●●

●●

●●●●

●●●●

● ●●

●●●

●●

●●

●●●●●

●●●●

●●●●● ●●●●●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●●

●●●●●

●●●●●

●●●●●

●●

●●

●●

●●●●●●

●●●

●●●●●

●●

●●●●

●●

●●● ●●

●●●

●●●

●●●●●●

●●

●●●●

●●●●

●●

●●

●●

●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●

●●

●●

●●●●

●●●●

●●●●

●●

●●

● ●●●

●●●

●●●

●●●●●●

●●●

●●●

●●

●●●●●●●

●●●

●●

●●

●●●

●●●

●●●

●●●●

●●●

●●●

●●●●●

●●●● ●●●

●●

●●

●●

●●●●●

●●

● ●●●

●●●

●●●●

●●

●●●●●

●●●●●

●●

●●●●●

●●●●●

●●●●●

● ●

●●●

●●

●●●●

●●

●●●●

●●●

●●●●●●

●●

●●●

●●

●●

●●

●●●● ●●

●●●

●●●

●●●●

●●●●●●●

●●●●●●

●●●

●●

●●●●●●

● ●●●

●●●

●●

●●●

●●●

● ●●●●

●●●●●

●●●●●

●●●●●

●●●●●●

●●

●●●●●●●●●

●●●●● ●

●●●●●●

●●●● ●●●●

●●

●●

●●●

●●●●●●

●●●●●●

●●

●●

●●

●●●●

●●●● ●●●●●

●●

●●●●

●●●●● ●●

●●●●●●●

●●●●

●●●

●●●

●●

●●●●●●●

●●●

●●●●

●●

● ●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●

●●

●●●●● ●●

●●

●●●●●●

●●●

●●●●●●

●●●●●

●●●●

●●

●●●●

●●

●●●

●●●

●●●

●●●●

●●

●●●

●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●

● ●

●●●

●●●

●●

●●

●●

●●

●●●●

●●●●●●●●●●●●● ●●●

●●●●●●

●●

●●

●●

●●●

●●●●●●●

●●●●●●●●

●●

●●●

●●

●● ●●●●●●●●●

V40

2

40 2 4

−6

−4

−2

−6 −4 −2

●●

●●

●●● ●

●●

● ●

●●●●

●●●●

● ●● ●●

●●

●●●

●●●●●●●

●●

●●

●●

●●

●●●

●●

● ●●●

●● ●● ●●●

●● ●

●●●

●●

●●●●

●●

●● ● ●●●

●● ●●●

● ●●● ●

●●

●●

●●●

●● ●

●●●●

●●●

●●

●●● ●

●●●●

●●●

●●●

●●

●●

●●●● ●

●●● ●

●●●● ●●● ●●●●●

●●●

● ●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

● ●●

●●●●●

●●

●●

●●

●●

●●●●●●

● ●●

●●●● ●

●●

● ●● ●

●●

●● ●

●●

● ●●

●●●

● ● ●●● ●

●●

●●●

●● ●●

●●

●●

●●

●●●

●●

●●● ●●

● ●●●●●●● ● ●●

●●

● ●●

●●

● ●

● ●●●

●●● ●● ●●

●●

●●

●●●●

●● ●

● ●●●

●●●●●

●●●

●●●

●●

● ●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

● ●●●

●●●

●●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●●●

● ●●●

●●●

●● ●●

●●

●●

●● ●●

● ●●●

●●

●●●●

●●●

●●●

●● ●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●●

● ●●

●●

● ●●

●●

●●

●●

●●● ● ●●

●●●

●●●

●●● ●●

●●

●●●●

●●

●●●●

●●●●

●●

●●

●●●●

●●●●

●●●

●●

● ●●

●● ●

● ●●●●

●●● ●●

●●●●●

●●●●●

●●●

●●●

● ●

● ●●● ●●●●●●● ●●● ●

●●●

●● ●

● ●●●●●●●

●●

●●

●● ●

●● ●●●●

●●

●●●●●

●●

●●

●●

●●●●

● ●●● ●●

●● ●●

●●

●●● ●

●● ● ●● ●

●●●●●

● ●●

●●●

●●

●●●

●●

●● ●●●

●●

●● ●

● ●● ●

●●

●●

● ●●● ●● ●

●●

●●

●●

●●

●●

●●

●●● ●

●●●●●

●●

●●

●●● ●●● ●

●●

●●● ●

●●

●●

●●

●●●●●

●● ●●●

●●●

●●●

●●●●

● ●

●●●●●

●● ●

●●●●

●●

●●

●● ●●●● ●

●●●

●●

●●

● ●●● ●●●●●●

●●● ●●● ●●

● ●

●●●

● ●●

●●●●

●●

●●

●●●●

● ●●

● ●● ●● ● ●●●

● ●●

●●●

●●●

●●

●●

●●

●●●

●●●●● ●●

●●●●● ●●●

●●

●●●

●●

● ●● ● ●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●● ●●

●●●

●●

●●●●●

●●

●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●●● ●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●●

●●

●●●

●●

●●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●●●●

●●

●●●

●●

●●●

●●● ●

●●

●●

●●

●●

●● ●

●●●●

●●●

●●●●

●● ●

●●●

● ●

●●● ●●

●●●

●●

● ●

●● ● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●● ●

●●

●●

●●

●●● ●

●●

●●

●●●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●● ●

●●

●●

●●● ●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●●●●● ●

●●●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●● ● ●●

●●

●●

●●●●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●●● ●

●●●

● ●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●●● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●●●

● ●

●●●

●●

●●●

●●● ●

●●

●●

●●

●●

●●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●● ●

●●●

●●

●●

●●● ●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●

●●●

● ●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●

●● ●

●●●

●●●

●●

●●

●●

●●●●●●

●●

● ●●●

●●

● ●●

●●

● ●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

● ●●

●●

●●●

●●

● ●●

●●

●●

●●●

●●

●●

●●● ●●

●●

●●

●●●

●●

●●

●●●

●●

●●● ●

●● ●

● ●●

●●●

●●●●

●●●●● ●

●●

●●

●●

●●●

●●

●●● ●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●

●●

●●●●●

●●

●●

●●●●

●●

● ●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●● ●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●●

●● ●

●●●

●●

●●●●●

●●●●

● ●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

● ●●

●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●

●● ●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●●●

●●●●●●●●

●●

●●

●●●

●●

●●●●

●●

● ●

●●●

●●

●●●

●●

●●●

●●●

●●●●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●● ●●

●●

●●

●●●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

● ●

● ●

●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●●

●●

●●●

●●●●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●● ●●

●●

●●●●

●●●

●●

●●

●●

●●

●●● ●

●●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●● ●

●●●

●●

●●

●●●

●●●●●

●●●

●●●●

●●●

●●●

● ●

●●●●●

●●●

●●

●●

●● ●●

●●

●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●●● ●

●●

●●●

●●

●●●●

●●

●●●

● ●

●●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●

●●●

● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●●

● ●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●

● ●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●● ●

●●

●●●●

●●● ●

●● ●

●●

●●●

●●

●●●

● ●●

●●●

●●

●●

●●●●

●●

●●

● ●

●●●

V51

2

3 1 2 3

−2

−1

0

−2 −1 0

96

Page 97: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Stochastic OptimizationI Each iteration of K-means requires a pass through whole dataset. In

extremely large datasets, this can be computationally prohibitive.I Stochastic optimization: update cluster means after assigning each data

point to the closest cluster.I Repeat for t = 1, 2, . . . until satisfactory convergence:

1. Pick data item xi either randomly or in order.2. Assign xi to the cluster with the nearest centre,

ci := argmink‖xi − µk‖2

2

3. Update cluster centre:µk := µk + αt(xi − µk)

where αt > 0 are step sizes.I Algorithm stochastically minimizes the objective function. Convergence

requires slowly decreasing step sizes:

∞∑

t=1

αt =∞∞∑

t=1

α2t <∞

97

Page 98: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Vector Quantization

I A related algorithm developed in the signal processing literature for lossydata compression.

I If K � n, we can store the codebook of codewords µ1, . . . , µK , and eachvector xi is encoded using ci, which only requires dlog Ke bits.

I As with K-means, K must be specified. Increasing K improves the qualityof the compressed image but worsens the data compression rate, sothere is a clear tradeoff.

I Some audio and video codecs use this method.I Stochastic optimization algorithm for K-means was originally developed

for VQ.

98

Page 99: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

VQ Image Compression

3× 3 block VQ: View each block of 3× 3 pixels as single observation

99

Page 100: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

VQ Image Compression

Original image (24 bits/pixel, uncompressed size 1,402 kB)

100

Page 101: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

VQ Image Compression

Codebook length 1024 (1.11 bits/pixel, total size 88kB)

101

Page 102: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

VQ Image Compression

Codebook length 128 (0.78 bits/pixel, total size 50kB)

102

Page 103: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

VQ Image Compression

Codebook length 16 (0.44 bits/pixel, total size 27kB)

103

Page 104: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

K-means Additional CommentsI Sensitivity to distance measure. Euclidean distance can be greatly

affected by measurement unit and by strong correlations. Can useMahalanobis distance,

‖x− y‖M =√

(x− y)>M−1(x− y)

where M is positive semi-definite matrix, e.g. sample covariance.I Other partition based methods. There are many other partition based

methods that employ related ideas. For example K-medoids differs fromK-means in requiring cluster centres µi to be an observation xi

2,K-medians (use median in each dimension) and K-modes (use mode).

I Determination of K. The K-means objective will always improve withlarger number of clusters K. Determination of K requires an additionalregularization criterion. E.g., in DP-means3, use

W =

K∑

k=1

i∈Ck

‖xi − µk‖22 + λK

2See also Affinity propagation.3DP-means paper.

104

Page 105: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Probabilistic Methods

I Algorithmic approach:

Analysis/Interpretation

Data

Algorithm

I Probabilistic modelling approach:

DataUnobservedprocess

GenerativeModel

AnalysisInterpretation

105

Page 106: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture ModelsI Mixture models suppose that our dataset was created by sampling iid

from K distinct populations (called mixture components).I Typical samples in population k can be modelled using a distribution

F(φk) with density f (x|φk). For a concrete example, consider a Gaussianwith unknown mean φk and known symmetric covariance σ2I,

f (x|φk) = |2πσ2|− p2 exp

(− 1

2σ2 ‖x− φk‖22

).

I Generative process: for i = 1, 2, . . . , n:I First determine which population item i came from (independently):

Zi ∼ Discrete(π1, . . . , πK) i.e. P(Zi = k) = πk

where mixing proportions are πk ≥ 0 for each k and∑K

k=1 πk = 1.I If Zi = k, then Xi = (Xi1, . . . ,Xip)

> is sampled (independently) fromcorresponding population distribution:

Xi|Zi = k ∼ F(φk)

I We observe that Xi = xi for each i, and would like to learn about theunknown parameters of the process.

106

Page 107: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture Models

I Unknowns to learn given data areI Parameters: π1, . . . , πK , φ1, . . . , φK , as well asI Latent variables: z1, . . . , zK .

I The joint probability over all cluster indicator variables {Zi} are:

pZ((zi)ni=1) =

n∏

i=1

πzi =

n∏

i=1

K∏

k=1

π1(zi=k)k

I The joint density at observations Xi = xi given Zi = zi are:

pX((xi)ni=1|(Zi = zi)

ni=1) =

n∏

i=1

K∏

k=1

f (xi|φk)1(zi=k)

I So the joint probability/density4 is:

pX,Z((xi, zi)ni=1) =

n∏

i=1

K∏

k=1

(πkf (xi|φk))1(zi=k)

4In this course we will treat probabilities and densities equivalently for notational simplicity. Ingeneral, the quantity is a density with respect to the product base measure, where the basemeasure is the counting measure for discrete variables and Lebesgue for continuous variables.

107

Page 108: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture Models - Posterior Distribution

I Suppose we know the parameters (πk, φk)Kk=1.

I Zi is a random variable, so the posterior distribution given data set X tellsus what we know about it:

Qik := p(Zi = k|xi) =p(Zi = k, xi)

p(xi)=

πkf (xi|φk)∑Kj=1 πjf (xi|φj)

where the marginal probability is:

p(xi) =

K∑

j=1

πjf (xi|φj)

I The posterior probability Qik of Zi = k is called the responsibility ofmixture component k for data point xi.

I The posterior distribution softly partitions the dataset among the kcomponents.

108

Page 109: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture Models - Maximum Likehood

I How can we learn about the parameters θ = (πk, φk)Kk=1 from data?

I Standard statistical methodology asks for the maximum likelihoodestimator (MLE).

I The log likelihood is the log marginal probability of the data:

`((πk, φk)Kk=1) := log p((xi)

ni=1|(πk, φk)

Kk=1) =

n∑

i=1

logK∑

j=1

πjf (xi|φj)

∇φk`((πk, φk)Kk=1) =

n∑

i=1

πkf (xi|φk)∑Kj=1 πjf (xi|φj)

∇φk log f (xi|φk)

=

n∑

i=1

Qik∇φk log f (xi|φk)

I A difficult equation to solve, as Qik depends implicitly on φk...

109

Page 110: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture Models - Maximum Likehood

n∑

i=1

Qik∇φk log f (xi|φk) = 0

I What if we ignore the dependence of Qik on the parameters?I Taking the mixture of Gaussian with covariance σ2I as example,

n∑

i=1

Qik∇φk

(−p

2|2πσ2| − 1

2σ2 ‖xi − φk‖22

)

=1σ2

n∑

i=1

Qik(xi − φk) =1σ2

((∑ni=1 Qikxi

)− φk

(∑ni=1 Qik

))= 0

φMLE?k =

∑ni=1 Qikxi∑ni=1 Qik

110

Page 111: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture Models - Maximum Likehood

I The estimate is a weighted average of data points, where the estimatedmean of cluster k uses its responsibilities to data points as weights.

φMLE?k =

∑ni=1 Qikxi∑ni=1 Qik

I Makes sense: Suppose we knew that data point xi came from populationzi. Then Qizi = 1 and Qik = 0 for k 6= zi and:

πMLEk =

∑i:zi=k xi∑i:zi=k 1

I Our best guess of the originating population is given by Qik.

111

Page 112: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture Models - Maximum Likehood

I For the mixing proportions, we can similarly derive an estimator.I Include a Lagrange multiplier λ to enforce constraint

∑k πk = 1.

∇logπk

(`((πk, φk)

Kk=1)− λ(

∑Kk=1 πk − 1)

)

=

n∑

i=1

πkf (xi|φk)∑Kj=1 πjf (xi|φj)

− λπk

=

n∑

i=1

Qik − λπkπMLE?k =

∑ni=1 Qik

n

I Again makes sense: the estimate is simply (our best guess of) theproportion of data points coming from population k.

112

Page 113: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture Models - The EM Algorithm

I Putting all the derivations together, we get an iterative algorithm forlearning about the unknowns in the mixture model.

I Start with some initial parameters (π(0)k , φ

(0)l )K

k=1.I Iterate for t = 1, 2, . . .:

I Expectation Step:

Q(t)ik :=

π(t−1)k f (xi|φ(t−1)

k )∑Kj=1 π

(t−1)j f (xi|φ(t−1)

j )

I Maximization Step:

π(t)k =

∑ni=1 Q(t)

ik

nφ(t)k =

∑ni=1 Q(t)

ik xi∑ni=1 Q(t)

ik

I Will the algorithm converge?I What does it converge to?

113

Page 114: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Likelihood Surface for a Simple Example01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

320 compBody.tex

−30 −20 −10 0 10 20 300

5

10

15

20

25

30

35

(a)

mu1

mu

2

−15.5 −10.5 −5.5 −0.5 4.5 9.5 14.5 19.5

−15.5

−10.5

−5.5

−0.5

4.5

9.5

14.5

19.5

(b)

Figure 11.6: Left: N = 200 data points sampled from a mixture of 2 Gaussians in 1d, with πk = 0.5, σk = 5, µ1 = −10 and µ2 = 10.Right: Likelihood surface p(D|µ1, µ2), with all other parameters set to their true values. We see the two symmetric modes, reflecting theunidentifiability of the parameters. Produced by mixGaussLikSurfaceDemo.

11.4.2.5 Unidentifiability

Note that mixture models are not identifiable, which means there are many settings of the parameters which have the samelikelihood. Specifically, in a mixture model withK components, there areK! equivalent parameter settings, which differ merelyby permuting the labels of the hidden states. See Figure 11.6 for an illustration. The existence of equivalent global modes doesnot matter when computing a single point estimate, such as the ML or MAP estimate, but it does complicate Bayesian inference,as we will in Section 12.5.6.3. Unfortunately, even finding just one of these global modes is computationally difficult. The EMalgorithm is only guaranteed to find a local mode. A variety of methods can be used to increase the chance of finding a goodlocal optimum. The simplest, and most widely used, is to perform multiple random restarts.

11.4.2.6 K-means algorithm

There is a variant of the EM algorithm for GMMs known as the K-means algorithm, which we now discuss.Consider a GMM in which we make the following assumptions: Σk = σ2ID is fixed, and πk = 1/K is fixed, so only the

cluster centers, µk ∈ RD, have to be estimated. Now consider an approximation to EM in which we make the approximation

p(zi = k|xi,θ) ≈ I(k = z∗i ) (11.61)

where zi∗ = arg maxk p(zi = k|xi,θ). This is sometimes called hard EM, since we are making a hard assignment of pointsto clusters. Since we assumed an equal spherical covariance matrix for each cluster, the most probable cluster for xi can becomputed by finding the nearest prototype:

z∗i = arg mink||xi − µk||2 (11.62)

Hence in each E step, we must find the Euclidean distance betweenN data points andK cluster centers, which takesO(NKD)time. However, this can be sped up using various techniques, such as applying the triangle inequality to avoid some redundantcomputations [Elk03]. Given the hard cluster assignments, the M step updates each cluster center by computing the mean of allpoints assigned to it:

µk =1

Nk

i:zi=k

xi (11.63)

The resulting method is equivalent to the K-means algorithm. See Algorithm 5 for the pseudo-code.

Algorithm 3: K-means algorithm

initialize mk, k ← 1 to K1

repeat2

Assign each data point to its closest cluster center: zi = arg mink ||xi − µk||23

Update each cluster center by computing the mean of all points assigned to it: µk = 1Nk

∑i:zi=k

xi4

until converged5

Since K-means is not a proper EM algorithm, it is not maximizing likelihood. Instead, it can be interpreted as a greedyalgorithm for approximately minimizing the reconstruction error created by using vector quantization, as discussed in Sec-tion 8.5.3.3. (See also Section 20.2.1.)

c© Kevin P. Murphy. Draft — not for circulation.

(left) n = 200 data points from a mixture of two 1D Gaussians withπ1 = π2 = 0.5, σ = 5 and µ1 = 10, µ2 = −10.(right) Log likelihood surface ` (µ1, µ2), all the other parameters beingassumed known.

114

Page 115: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Mixture of 3 GaussiansAn example with 3 clusters.

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

−5 0 5 10

−5

05

X1

X2

115

Page 116: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Mixture of 3 GaussiansAfter 1st E and M step.

−5 0 5 10

−50

5

data[,1]

data[,2]

Iteration 1

116

Page 117: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Mixture of 3 GaussiansAfter 2nd E and M step.

−5 0 5 10

−50

5

data[,1]

data[,2]

Iteration 2

117

Page 118: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Mixture of 3 GaussiansAfter 3rd E and M step.

−5 0 5 10

−50

5

data[,1]

data[,2]

Iteration 3

118

Page 119: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Mixture of 3 GaussiansAfter 4th E and M step.

−5 0 5 10

−50

5

data[,1]

data[,2]

Iteration 4

119

Page 120: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Example: Mixture of 3 GaussiansAfter 5th E and M step.

−5 0 5 10

−50

5

data[,1]

data[,2]

Iteration 5

120

Page 121: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

The EM AlgorithmI In a maximum likelihood framework, the objective function is the log

likelihood,

`(θ) =

n∑

i=1

logK∑

j=1

πjf (xi|φj)

Direct maximization is not feasible.I Consider another objective function F(θ, q) such that:

F(θ, q) ≤ `(θ) for all θ, q,max

qF(θ, q) = `(θ)

F(θ, q) is a lower bound on the log likelihood.I We can construct an alternating maximization algorithm as follows:

For t = 1, 2 . . . until convergence:

q(t) := argmaxqF(θ(t−1), q)

θ(t) := argmaxθF(θ, q(t))

121

Page 122: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

EM Algorithm

I The lower bound we use is called the variational free energy.I q is a probability mass function for some distribution over (Zi) and

F(θ, q) =Eq[log p((xi, zi)ni=1)− log q((zi)

ni=1)]

=Eq

[(n∑

i=1

K∑

k=1

1(zi = k) (logπk + log f (xi|φk))

)− log q(z)

]

=∑

z

q(z)

[(n∑

i=1

K∑

k=1

1(zi = k) (logπk + log f (xi|φk))

)− log q(z)

]

Using z := (zi)ni=1 to shorten notation.

122

Page 123: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

EM Algorithm - Solving for qI Introducing Lagrange multiplier to enforce

∑z q(z) = 1, and setting

derivatives to 0,

∇q(z)F(θ, q) =

n∑

i=1

K∑

k=1

1(zi = k) (logπk + log f (xi|φk))− log q(z)− 1− λ

=

n∑

i=1

(logπzi + log f (xi|φzi))− log q(z)− 1− λ = 0

q∗(z) =

∏ni=1 πzi f (xi|φzi)∑

z′∏n

i=1 πz′i f (xi|φz′i )=

n∏

i=1

πzi f (xi|φzi)∑k πkf (xi|φk)

=

n∏

i=1

p(zi|xi, θ)

I Optimal q∗ is simply the posterior distribution.I Plugging in optimal q∗ into the variational free energy,

F(θ, q∗) =

n∑

i=1

logK∑

k=1

πkf (xi|φk) = `(θ)

123

Page 124: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

EM Algorithm - Solving for θI Setting derivative with respect to φk to 0,

∇φkF(θ, q) =∑

z

q(z)

n∑

i=1

1(zi = k)∇φk log f (xi|φk)

=

n∑

i=1

q(zi = k)∇φk log f (xi|φk) = 0

I This equation can be solved quite easily. E.g., for mixture of Gaussians,

φ∗k =

∑ni=1 q(zi = k)xi∑ni=1 q(zi = k)

I If it cannot be solved exactly, we can use gradient ascent algorithm:

φ∗k = φk + α

n∑

i=1

q(zi = k)∇φk log f (xi|φk)

I This leads to generalized EM algorithm. Further extension usingstochastic optimization method leads to stochastic EM algorithm.

I Similar derivation for optimal πk as before.124

Page 125: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

EM AlgorithmI Start with some initial parameters (π

(0)k , φ

(0)l )K

k=1.I Iterate for t = 1, 2, . . .:

I Expectation Step:

q(t)(zi = k) :=π(t−1)k f (xi|φ(t−1)

k )∑Kj=1 π

(t−1)j f (xi|φ(t−1)

j )= Ep(zi|xi,θ

(t−1))[1(zi = k)]

I Maximization Step:

π(t)k =

∑ni=1 q(t)(zi = k)

nφ(t)k =

∑ni=1 q(t)(zi = k)xi∑n

i=1 q(t)(zi = k)

I Each step increases the log likelihood:

`(θ(t−1)) = F(θ(t−1), q(t)) ≤ F(θ(t), q(t)) ≤ F(θ(t), q(t+1)) = `(θ(t)).

I Additional assumption, that ∇2θF(θ(t), q(t)) are negative definite with

eigenvalues < −ε < 0, implies that θ(t) → θ∗ where θ∗ is a local MLE.

125

Page 126: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Notes on Probabilistic Approach and EM Algorithm

Some good things:

I Guaranteed convergence to locally optimal parameters.I Formal reasoning of uncertainties, using both Bayes Theorem and

maximum likelihood theory.I Rich language of probability theory to express a wide range of generative

models, and straightforward derivation of algorithms for ML estimation.

Some bad things:

I Can get stuck in local minima so multiple starts are recommended.I Slower and more expensive than K-means.I Choice of K still problematic, but rich array of methods for model

selection comes to rescue.

126

Page 127: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Flexible Gaussian Mixture Models

I We can allow each cluster to have its own mean and covariance structureallows greater flexibility in the model.

Different covariances

Identical covariances

Different, but diagonal covariances

Identical and spherical covariances

127

Page 128: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Probabilistic PCAI A probabilistic model related to PCA has the following generative model:

for i = 1, 2, . . . , n:I Let k < n, p be given.I Let Yi be a k-dimensional normally distributed random variable with 0 mean

and identity covariance:Yi ∼ N (0, Ik)

I We model the distribution of the ith data point given Yi as a p-dimensionalnormal:

Xi ∼ N (µ+ LYi, σ2I)

where the parameters are a vector µ ∈ Rp, a matrix L ∈ Rp×k and σ2 > 0.I EM algorithm can be used for ML estimation, but PCA can more directly

give a MLE (note this is not unique).I Let λ1 ≥ · · · ≥ λp be the eigenvalues of the sample covariance and let

V ∈ Rp×k have columns given by the eigenvectors of the top keigenvalues. Let R ∈ Rk×k be orthogonal. Then a MLE is:

µMLE = x̄ (σ2)MLE = 1p−k

∑pj=k+1 λj

LMLE = V diag((λ1 − (σ2)MLE)12 , . . . , (λk − (σ2)MLE)

12 )R

128

Page 129: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture of Probabilistic PCAs

I We have learnt two types of unsupervised learning techniques:I Dimensionality reduction, e.g. PCA, MDS, Isomap.I Clustering, e.g. K-means, linkage and mixture models.

I Probabilistic models allow us to construct more complex models fromsimpler pieces.

I Mixture of probabilistic PCAs allows both clustering and dimensionalityreduction at the same time.

Zi ∼ Discrete(π1, . . . , πK)

Yi ∼ N (0, Id)

Xi|Zi = k,Yi = yi ∼ N (µk + Lyi, σ2Ip)

I Allows flexible modelling of covariance structure without using too manyparameters.

Ghahramani and Hinton 1996 129

Page 130: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Mixture of Probabilistic PCAsI PCA can reconstruct x given low dimensional embedding z, but is linear.I Isomap is non-linear, but cannot reconstruct x given any z.

tion to geodesic distance. For faraway points,geodesic distance can be approximated byadding up a sequence of “short hops” be-tween neighboring points. These approxima-tions are computed efficiently by findingshortest paths in a graph with edges connect-ing neighboring data points.

The complete isometric feature mapping,or Isomap, algorithm has three steps, whichare detailed in Table 1. The first step deter-mines which points are neighbors on themanifold M, based on the distances dX (i, j)between pairs of points i, j in the input space

X. Two simple methods are to connect eachpoint to all points within some fixed radius !,or to all of its K nearest neighbors (15). Theseneighborhood relations are represented as aweighted graph G over the data points, withedges of weight dX(i, j) between neighboringpoints (Fig. 3B).

In its second step, Isomap estimates thegeodesic distances dM (i, j) between all pairsof points on the manifold M by computingtheir shortest path distances dG(i, j) in thegraph G. One simple algorithm (16 ) for find-ing shortest paths is given in Table 1.

The final step applies classical MDS tothe matrix of graph distances DG " {dG(i, j)},constructing an embedding of the data in ad-dimensional Euclidean space Y that bestpreserves the manifold’s estimated intrinsicgeometry (Fig. 3C). The coordinate vectors yi

for points in Y are chosen to minimize thecost function

E ! !#$DG% " #$DY%!L2 (1)

where DY denotes the matrix of Euclideandistances {dY(i, j) " !yi & yj!} and !A!L2

the L2 matrix norm '(i, j Ai j2 . The # operator

Fig. 1. (A) A canonical dimensionality reductionproblem from visual perception. The input consistsof a sequence of 4096-dimensional vectors, rep-resenting the brightness values of 64 pixel by 64pixel images of a face rendered with differentposes and lighting directions. Applied to N " 698raw images, Isomap (K" 6) learns a three-dimen-sional embedding of the data’s intrinsic geometricstructure. A two-dimensional projection is shown,with a sample of the original input images (redcircles) superimposed on all the data points (blue)and horizontal sliders (under the images) repre-senting the third dimension. Each coordinate axisof the embedding correlates highly with one de-gree of freedom underlying the original data: left-right pose (x axis, R " 0.99), up-down pose ( yaxis, R " 0.90), and lighting direction (slider posi-tion, R " 0.92). The input-space distances dX(i, j )given to Isomap were Euclidean distances be-tween the 4096-dimensional image vectors. (B)Isomap applied to N " 1000 handwritten “2”sfrom the MNIST database (40). The two mostsignificant dimensions in the Isomap embedding,shown here, articulate the major features of the“2”: bottom loop (x axis) and top arch ( y axis).Input-space distances dX(i, j ) were measured bytangent distance, a metric designed to capture theinvariances relevant in handwriting recognition(41). Here we used !-Isomap (with ! " 4.2) be-cause we did not expect a constant dimensionalityto hold over the whole data set; consistent withthis, Isomap finds several tendrils projecting fromthe higher dimensional mass of data and repre-senting successive exaggerations of an extrastroke or ornament in the digit.

R E P O R T S

22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org2320

on

De

ce

mb

er

5,

20

08

w

ww

.scie

nce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

I We can learn a probabilistic mapping between the k-dimensional Isomapembedding space and the p-dimensional data space.

I Demo: [Using LLE instead of Isomap, and Mixture of factor analysersinstead of Mixture of PPCAs.]

Teh and Roweis 2002 130

Page 131: Oxford Statisticsteh/teaching/smldmHT2014/lectures-01… · Course Structure Lectures I 1400-1500 Mondays in Math Institute L4. I 1000-1100 Wednesdays in Math Institute L3. Part C:

Further Readings—Unsupervised Learning

I Hastie et al, Chapter 14.I James et al, Chapter 10.I Venables and Ripley, Chapter 11.I Tukey, John W. (1980). We need both exploratory and confirmatory. The

American Statistician 34 (1): 23-25.

131