28
Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University L i n k i n g S c i e n c e t o S o c i e t y L i n k i n g S c i e n c e t o S o c i e t y

Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Embed Size (px)

Citation preview

Page 1: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Principal Components:A Conceptual Introduction

Simon Mason

International Research Institute for Climate Prediction

The Earth Institute of Columbia University

L i n k i n g S c i e n c e t o S o c i e t yL i n k i n g S c i e n c e t o S o c i e t y

Page 2: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

What makes a good soccer team?

Everybody(?) has their favourite soccer team. But which is the best team, and how can we determine that it is the best?

We usually justify our choice of best team by describing it in rather vague ways such as “good at scoring goals”, “excellent defensive line”, “fair players”.

We need some quantifiable metrics rather than vague descriptions.

Page 3: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Soccer-Playing Metrics

Metrics can be defined for measuring the quality of a soccer team objectively.

Each metric could be measured over a season or a number of seasons.

Page 4: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Soccer-Playing Metrics

1. Frequency of home wins (home wins).

2. Frequency of home losses (home losses).

3. Frequency of home goals scored (home for).

4. Frequency of home goals ceded (home against).

5. Frequency of away wins (away wins).

6. Frequency of away losses (away losses).

7. Frequency of away goals scored (away for).

8. Frequency of away goals ceded (away against).

9. Number of bookings (bookings).

10. Average attendance (attendance).

Page 5: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

English Premiership Teams 2003/04

1. Arsenal

2. Aston Villa

3. Birmingham

4. Blackburn Rovers

5. Bolton Wanderers

6. Charlton Athletic

7. Chelsea

8. Everton

9. Fulham

10. Leeds United

11. Leicester City

12. Liverpool

13. Manchester City

14. Manchester United

15. Middlesbrough

16. Newcastle United

17. Portsmouth

18. Southampton

19. Tottenham Hotspur

20. Wolverhampton Wanderers

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 6: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Hom

e w

ins

Hom

e lo

sses

Hom

e fo

r

Hom

e A

gain

st

Aw

ay w

ins

Aw

ay lo

sses

Aw

ay fo

r

Aw

ay a

gain

st

Boo

king

s

Att

enda

nce

Arsenal 15 0 40 14 11 0 33 12 58 38079

Aston Villa 9 4 24 19 12 4 24 25 58 36622

Birmingham 8 6 26 24 11 6 17 24 55 29074

Blackburn 5 10 25 31 6 5 26 28 67 24376

Bolton 6 5 24 21 2 5 24 35 66 26795

Charlton 7 6 29 29 6 8 22 22 42 26293

Chelsea 12 3 34 13 7 7 33 17 51 41234

Everton 8 6 27 20 8 8 18 37 59 38837

Fulham 9 6 29 21 5 8 23 25 68 16342

Leeds 5 7 25 31 4 6 15 48 81 36666

Leicester 3 6 19 28 5 9 29 37 73 30983

Liverpool 10 5 29 15 4 10 26 22 49 42677

Manchester City 5 5 31 24 2 12 24 30 53 46834

Manchester United 12 3 37 15 4 13 27 20 49 67641

Middlesbrough 8 7 25 23 7 8 19 29 58 30398

Newcastle 11 3 33 14 4 10 19 26 53 51440

Portsmouth 10 5 35 19 1 11 12 35 68 20108

Southampton 8 5 24 17 3 11 20 28 59 31717

Tottenham 9 6 33 27 3 14 14 30 63 34876

Wolves 7 7 23 35 0 12 15 42 70 28874

Page 7: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

The Premiership Metric

In the Premiership the teams are ranked according to the number of games they win and draw, and then by goal difference if there are ties.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

home wins away wins

home draws away draws

goals for goals against

bookin

s

g

core 3.0

1.0

0 s atten. e0 danc

c

where 0.0 1.0c

I.e., a weighted sum of the metrics is used to rank the teams.

Page 8: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Hom

e w

ins

Hom

e lo

sses

Hom

e fo

r

Hom

e ag

ain

st

Aw

ay w

ins

Aw

ay lo

sses

Aw

ay fo

r

Aw

ay a

gain

st

Boo

kin

gs

Att

end

anc

e

Poi

nts

Arsenal 15 0 40 14 11 0 33 12 58 38079 90

Chelsea 12 3 34 13 12 4 33 17 51 41234 79

Manchester Utd 12 3 37 15 11 6 27 20 49 67641 75

Liverpool 10 5 29 15 6 5 26 22 49 42677 60

Newcastle 11 3 33 14 2 5 19 26 53 51440 56

Aston Villa 9 4 24 19 6 8 24 25 58 36622 56

Charlton 7 6 29 29 7 7 22 22 42 26293 53

Bolton 6 5 24 21 8 8 24 35 66 26795 53

Fulham 9 6 29 21 5 8 23 25 68 16342 52

Birmingham 8 6 26 24 4 6 17 24 55 29074 50

Middlesbrough 8 7 25 23 5 9 19 29 58 30398 48

Southampton 8 5 24 17 4 10 20 28 59 31717 47

Portsmouth 10 5 35 19 2 12 12 35 68 20108 45

Tottenham 9 6 33 27 4 13 14 30 63 34876 45

Blackburn 5 10 25 31 7 8 26 28 67 24376 44

Mancester City 5 5 31 24 4 10 24 30 53 46834 41

Everton 8 6 27 20 1 11 18 37 59 38837 39

Leicester 3 6 19 28 3 11 29 37 73 30983 33

Leeds 5 7 25 31 3 14 15 48 81 36666 33

Wolves 7 7 23 35 0 12 15 42 70 28874 33

Page 9: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

A General Metric

A good team should score highly on all the metrics (note that losses, against and bookings can be measured so that high scores indicate good play by multiplying these scores by -1).

If we can combine the original metrics into one new metric that captures as much of the information in the ten metrics as possible, we will have a new general metric that we can use as an overall measure of the quality of a soccer team.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 10: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Variance

The differences between the teams on the various metrics provides the information we can use to distinguish good from bad teams.

On some metrics (e.g., attendance) the differences are large, but on others (e.g., home losses) most teams score about the same. The variance of each metric tells us the total amount of information we have to distinguish the teams.

The total information available to distinguish the teams is the sum of the variances of each metric.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 11: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Variance

Standardized variance

Home wins 8.2 1.00 Home losses 4.2 1.00 Away wins 11.0 1.00 Away losses 11.8 1.00 Home for 29.0 1.00 Home against 42.4 1.00 Away for 36.1 1.00 Away against 74.1 1.00 Bookings 90.3 1.00 Attendance 134200466.4 1.00 Total 134200773.7 10.00

Since virtually all of the total variance is contributed by attendance, teams need to perform well on this metric. Alternatively, the metrics could be standardized to give them equal weight.

Page 12: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Standardize?

If we want to give each metric the same weight we should standardize the data first otherwise a team which performs poorly on a metric with high variance is likely to score badly overall – it will be difficult to make up the large deficit from metrics on which teams tend to score similarly.

The variance of the standardized metrics is 1.0. Therefore the total standardized variance will be 10.0 (the number of metrics).

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 13: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Hom

e w

ins

Hom

e lo

sses

Aw

ay w

ins

Aw

ay lo

sses

Hom

e fo

r

Hom

e ag

ain

st

Aw

ay fo

r

Aw

ay a

gain

st

Boo

king

s

Att

end

ance

Poi

nts

Arsenal 2.32 2.56 2.12 1.23 1.73 2.43 1.83 1.93 0.21 0.27 1.66

Chelsea 1.27 1.10 1.00 1.38 2.03 1.27 1.83 1.35 0.95 0.54 1.27

Manchester United 1.27 1.10 1.56 1.07 1.73 0.68 0.83 1.00 1.16 2.82 1.32

Liverpool 0.57 0.12 0.07 1.07 0.23 0.97 0.67 0.77 1.16 0.66 0.63

Newcastle 0.92 1.10 0.82 1.23 -0.98 0.97 -0.50 0.30 0.74 1.42 0.60

Aston Villa 0.23 0.61 -0.85 0.46 0.23 0.10 0.33 0.42 0.21 0.14 0.19

Charlton -0.47 -0.37 0.07 -1.07 0.53 0.39 0.00 0.77 1.89 -0.75 0.10

Bolton -0.82 0.12 -0.85 0.15 0.83 0.10 0.33 -0.74 -0.63 -0.71 -0.22

Fulham 0.23 -0.37 0.07 0.15 -0.08 0.10 0.17 0.42 -0.84 -1.61 -0.18

Birmingham -0.12 -0.37 -0.48 -0.31 -0.38 0.68 -0.83 0.53 0.53 -0.51 -0.13

Middlesbrough -0.12 -0.85 -0.67 -0.15 -0.08 -0.19 -0.50 -0.05 0.21 -0.40 -0.28

Southampton -0.12 0.12 -0.85 0.77 -0.38 -0.48 -0.33 0.07 0.11 -0.28 -0.14

Portsmouth 0.57 0.12 1.19 0.46 -0.98 -1.06 -1.66 -0.74 -0.84 -1.28 -0.42

Tottenham 0.23 -0.37 0.82 -0.77 -0.38 -1.35 -1.33 -0.16 -0.32 -0.01 -0.36

Blackburn -1.17 -2.32 -0.67 -1.38 0.53 0.10 0.67 0.07 -0.74 -0.92 -0.58

Manchester City -1.17 0.12 0.45 -0.31 -0.38 -0.48 0.33 -0.16 0.74 1.02 0.02

Everton -0.12 -0.37 -0.30 0.31 -1.28 -0.77 -0.67 -0.98 0.11 0.33 -0.37

Leicester -1.86 -0.37 -1.78 -0.92 -0.68 -0.77 1.16 -0.98 -1.37 -0.35 -0.79

Leeds -1.17 -0.85 -0.67 -1.38 -0.68 -1.64 -1.16 -2.25 -2.21 0.14 -1.19

Wolves -0.47 -0.85 -1.04 -2.00 -1.58 -1.06 -1.16 -1.56 -1.05 -0.53 -1.13

Page 14: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

The Average

The simplest combined score is to average the scores (or standardized scores) on each metric.

But information is lost: the variance of the average scores is only about 0.59, compared to the total variance of 10.0).

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

home winsaverage 0.1 ... 0. atten1 dance

Page 15: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

The Average

Also, the simple average is not very informative: if we ask why a team is good, the only way to answer is to refer to all ten metrics, which is inefficient for two reasons:

1. there are too many metrics to which to refer;

2. some of the metrics are very similar, so if we know that a team scored well on one metric we can assume that it probably scored well on a similar metric …

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 16: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Correlations Between the Metrics

Some of the metrics seem to measure similar characteristics.

For example, home for and away for both relate to the team’s goal-scoring achievements.

Correlations between the metrics can be used to tell us whether the metrics are measuring similar aspects of the quality of a soccer team.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 17: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Correlations Between the Metrics

Sum of diagonals = 10.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

H

ome

win

s

Hom

e lo

sses

Hom

e fo

r

Hom

e ag

ains

t

Aw

ay w

ins

Aw

ay lo

sses

Aw

ay fo

r

Aw

ay a

gain

st

Boo

king

s

Atte

ndan

ce

Home wins 1.00 0.78 0.81 0.76 0.49 0.67 0.26 0.71 0.47 0.38

Home losses 0.78 1.00 0.67 0.78 0.48 0.64 0.45 0.59 0.43 0.52

Home for 0.81 0.67 1.00 0.58 0.47 0.50 0.20 0.60 0.44 0.44

Home against 0.76 0.78 0.58 1.00 0.47 0.64 0.42 0.65 0.51 0.46

Away wins 0.49 0.48 0.47 0.47 1.00 0.71 0.78 0.74 0.44 0.30

Away losses 0.67 0.64 0.50 0.64 0.71 1.00 0.71 0.88 0.62 0.30

Away for 0.26 0.45 0.20 0.42 0.78 0.71 1.00 0.64 0.33 0.29

Away against 0.71 0.59 0.60 0.65 0.74 0.88 0.64 1.00 0.75 0.28

Bookings 0.47 0.43 0.44 0.51 0.44 0.62 0.33 0.75 1.00 0.47

Attendance 0.38 0.52 0.44 0.46 0.30 0.30 0.29 0.28 0.47 1.00

Page 18: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Independent Metrics

Positive correlations between the metrics show that they are measuring similar aspects of the quality of a soccer team.

We would like to combine the metrics somehow so that common aspects are measured on a single metric, and each combination measures a different aspect of the quality of a soccer team (i.e., the correlations between these new metrics is zero). The single metric must have high variance so that teams can be distinguished effectively.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 19: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Independent Metrics

Objectives:

New metrics that meet these objectives are called principal components.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

1. the new metrics are uncorrelated;

2. each metric in turn summarizes as much information as possible (its variance is maximized);

3. there is no loss of information.

Page 20: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Principal Components

Principal components are weighted sums of the original metrics. Weighted sums are like weighted averages, except that the weights do not have to add up to 1.0. Instead, with principal components the squares of the weights add up to 1.01. The weights are known as eigenvectors, and are frequently referred to as loadings.

The weighted sums are the scores on the new metrics. The new metrics are called principal components.

1 A few authors draw the following distinction: for EOFs the sum of the squared weights is 1; for principal components the sum is equal to the length of the eigenvalue.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 21: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Covariances Between the Principal Components

Sum of diagonals = 10

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

P

C 1

PC

2

PC

3

PC

4

PC

5

PC

6

PC

7

PC

8

PC

9

PC

10

PC 1 6.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

PC 2 0.00 1.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

PC 3 0.00 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 0.00

PC 4 0.00 0.00 0.00 0.71 0.00 0.00 0.00 0.00 0.00 0.00

PC 5 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00

PC 6 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.00 0.00 0.00

PC 7 0.00 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.00 0.00

PC 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00

PC 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.08 0.00

PC 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05

Page 22: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Eigenvalues

The variances of the principal components are called eigenvalues.

The total variance explained by all the principal components is the same as that of the original standardized metrics, and so no information is lost. But most of the total variance is explained by only a few components. Compare the variance of the average of the standardized score (0.59).

Principal components with variances > 1.0 have more information than any of the original standardized metrics.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 23: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Soccer Team Principal Component 1

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 24: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Soccer Team Principal Component 1

We can obtain a score for a team by calculating the weighted average of its scores on the 10 original metrics:

We can get a score for each team …

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Arsenal home wPC 1 0.342 ... 0.221ins attendance

Page 25: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Soccer Team Principal Component 1

Page 26: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Soccer-Player Principal Component 1

The score tells us whether the team out-performs their opponents, while playing fairly, and drawing large crowds.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Page 27: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !

Soccer-Player Principal Component 2

Page 28: Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University

Soccer-Player Principal Component 2

The score tells us whether the team plays better at home or away.

L i n k i n g S c i e n c e t o S p o r t !L i n k i n g S c i e n c e t o S p o r t !