Lecture 1: Multivariate Data - Uppsala Universitythulin/mm/L1.pdfMultivariate data and matrices General situation: p variables are measured on n subjects (patients, items, countries,

Lecture 1: Multivariate Data

Mans Thulin

Department of Mathematics, Uppsala University

[email protected]

Multivariate Methods • 22/3 2011

1/30

Outline

I Multivariate data and matricesI Notation and basic facts

I Descriptive statisticsI Generalizations of univariate means, variances...

I Graphical methodsI How to visualize multivariate data sets

I DistancesI Is there a need to go beyond Euclid?

I Linear algebraI What’s important in Chapter 2?

2/30

Multivariate data and matrices

General situation: p variables are measured on n subjects(patients, items, countries, ...). That is, we have n observations ofa p-variate random variable.

Ex: Height, weight and age are measured for 23 people. Thenp = 3 and n = 23.

The data is stored in an array (a matrix):

X =

x11 x12 . . . x1p

x21 x22 . . . x2p...

.... . .

...xn1 xn2 . . . xnp

Row j contains the p measurements for subject j .xjk = measurement k for subject j .

3/30

Descriptive statistics: mean and variance

For univariate data, we usually look at summary statistics, such asthe sample mean x and the sample variance s2.We can calculate these as usual for each of the p variables:

xk =1

n

n∑j=1

xjk

s2k = skk =

1

n − 1

n∑j=1

(xjk − xk)2, k = 1, 2, . . . , p

The marginal sample variances s2k are usually put in a sample

covariance matrix together with the sample covariances.

Beware: Notational hazard! When we look at the samplecovariance matrix, s2

k is usually denoted skk , without the square!The sample standard deviation for variable k is denoted

√skk .

4/30

Descriptive statistics: covariance and correlation

The sample covariance between the variables k and ` is defined as

sk` = s`k =1

n − 1

n∑j=1

(xjk − xk)(xj` − x`), k, ` = 1, 2, . . . , p

It measures the linear association between the variables. It iscommon to rescale the covariance by dividing by the standarddeviations of the variables. The number thus obtained is thesample correlation:

rk` = r`k =sk`√

skk√s``

Note that rkk = 1.

5/30

Descriptive statistics: arrays

Sample mean: x =

x1

x2...xp

Sample covariance matrix: Sn =

s11 s12 · · · s1p

s12 s22 · · · s2p...

.... . .

...s1p s2p · · · spp

Sample correlation matrix: R =

1 r12 · · · r1pr12 1 · · · r2p...

.... . .

...r1p r2p · · · 1

6/30

Graphical methods

”A picture is worth a thousand words...”

Some useful approaches for visualizing multivariate data are:

I 3D plots

I Scatter plots

I Bubble plots

I Stars

I Chernoff faces

I Andrews’ curves

(on the other hand, 1001 words are worth more than a picture...)

7/30

Graphical methods: 3D plotsThree dimensional scatter plots:

−3 −2 −1 0 1 2 3 4

−3

−2

−1

0 1

2 3

−3

−2

−1

0

1

2

xyz[,1]

xyz[

,2]xy

z[,3

]

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

8/30

Graphical methods: 3D plots

I Often a good choice for three dimensional data.

I Enables us to see patterns that would disappear if the datawas projected to two dimensions.

I Best when interactive – i.e. when the plot can be rotated.

9/30

Graphical methods: Scatter plotsScatter plots of each pair of variables, usually combined with themarginal histograms.

x_1

6.0 7.0 8.0 9.0

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●

●●●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●

●●●●

●

●

●●

●

●

●●

● ●

●

●

●

5 6 7 8 9

67

89

10

●

●●

●

●

●

●

●●

●

●

●●●

●

●

●●

●

●

●

●

●●

●●

●●●

●

●

●

●●

●

●

●●

● ●

●

●

●

6.0

7.0

8.0

9.0

●

●

●

●

●

●●

●

●

●●●

●

●

● ●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

x_2●

●

●

●

●

●●

●

●

● ●●

●

●

●●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●●

● ●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

● ●

●

●

●●●

●

●

●

●

●

x_3

56

78

9

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●

●

● ●

●

●

●●●

●

●

●

●

●

6 7 8 9 10

56

78

9

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

5 6 7 8 9

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

x_4

10/30

Graphical methods: Scatter plots

I Can, unlike 3D plots, be used for p greater than 3.

I Usually a good first choice for plotting.

I Good for detecting dependencies between pairs of variables.

I Higher-dimensional perspective is lost – importantdependencies between more than two variables might gounnoticed.

I Useful for assessing multivariate normality.

11/30

Graphical methods: Bubble plots

In a regular 2D plot, a third dimension can be illustrated usingbubbles of different sizes.

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

0 50 100 150 200 250 300

850

900

950

1000

1050

1100

1150

Bubble plot

SO2[1:15]

Mor

talit

y[1:

15]

12/30


In a regular 2D plot, a third dimension is illustrated using bubblesof different sizes.

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

0 50 100 150 200 250 300

850

900

950

1000

1050

1100

1150

Bubble plot

SO2[1:15]

Mor

talit

y[1:

15]

●

13/30


I Simple, but easy to interpret.

I Possible to make nice-looking plots.I Possible extensions to higher dimensions?

I 3D bubble plots, colours of bubbles, shapes of bubbles, timedimension...

14/30

Graphical methods: Stars

The length of the rays from the center of the figure represent thevalues of the variables. Example with p = 7 and n = 9:

Motor Trend Cars

Mazda RX4Mazda RX4 Wag

Datsun 710

Hornet 4 DriveHornet Sportabout

Valiant

Duster 360Merc 240D

Merc 230

Different versions of stars can be found in the literature.

15/30


The lengths of the rays from the center of the figure represent thevalues of the variables. Example with p = 7 and n = 9:

Motor Trend Cars

Mazda RX4Mazda RX4 Wag

Datsun 710

Hornet 4 DriveHornet Sportabout

Valiant

Duster 360Merc 240D

Merc 230

Different versions of stars can be found in the literature.

16/30


I Can be used for dimensions higher than three.

I Relatively easy to interpret.

I Can be useful for finding similar data points.

I If plotted in a time sequence, can illustrate change over time.

I Only useful for relative comparisons.

17/30

Graphical methods: Chernoff facesThe human mind is extremely good at facial recognition. Chernoffproposed illustrating data sets with facial features.Herman Chernoff (1973), The use of faces to represent points in k-dimensional space graphically, Journal of the

American Statistical Association, 68, pp. 361-368.

Index

akronOH

Index

albanyNY

Index

allenPA

Index

bufaloNY

Index

cantonOH

Index

chatagTN

18/30

Graphical methods: Chernoff faces

I Each variable is represented by a facial feature (e.g. length ofnose, size of eyes, width of head...).

I Easy to see groups or clusters in the data.

I Can be used to find outliers.

I If plotted in a time sequence, can illustrate change over time.

I Can be used for p ≤ 18.

I Care must be taken when choosing which variable isrepresented by which facial feature. We react stronger tochanges in some of the features!

I Useful or just plain stupid?

I Does the implementation in R work properly? See Computerexercise 1.

19/30

Graphical methods: Andrews’ curvesAndrews proposed that the observations should be projected ontoa p-dimensional space of functions, since we are used to comparingfunctions. He used the observations as Fourier coefficients andplotted the corresponding functions in the interval 0 < t < 2π:

fx(t) = x1/√

2 + x2 sin t + x3 cos t + x4 sin 2t + x5 cos 2t + . . .

D.F. Andrews (1973), Plots of High-Dimensional Data, Biometrics, 28, pp. 125-136

0 1 2 3 4 5 6

05

1015

Andrews' Curves

setosaversicolorvirginica

20/30

Graphical methods: Andrews’ curves

I Useful for finding clusters – subgroups of data arecharacterized by similar curves.

I Can be used to find linear relationships – if a point y lies on aline between x and z then fy(t) lies between fx(t) and fz(t)for all t.

I Can illustrate data with very high dimensions.

I Can be used for testing (see original article).

I Becomes cluttered when there are too many data points (as inthe picture on the previous slide?).

I Perhaps not as intuitive as some of the other methods – moreuseful to mathematicians than to practitioners?

21/30

Geographical data

The social network Facebook store lots of data about their usersand their online activities.

Such large databases – or parts thereof – can be difficult tovisualize.

In December 2010 the Facebook infrastructure engineering teamused R to create a map of Facebook friendships. Lines betweencities represent friendships between the cities’ inhabitants.

22/30

Geographical data

23/30

Geographical data

24/30

Geographical data

25/30

DistancesWhy statistical distances?

I Account for differences in variationI Account for presence of correlation

●●

●

●

●

●●

●

●

● ●●

●● ●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

−15 −10 −5 0 5 10 15

−5

05

10

x

y

26/30

Distances: definition

Definition. Let P and Q be two points, where these mayrepresent measurements x and y on two objects. A real-valuedfunction d(P,Q) is a distance function if it satisfies the followingproperties:

(i) d(P,Q) = d(Q,P) (Symmetry)

(ii) d(P,Q) ≥ 0 (Non-negativity)

(iii) d(P,P) = 0 (Identification mark)

For many distance functions the following properties also hold:

(iv) d(P,Q) = 0 iff P = Q (Definiteness)

(v) d(P,Q) ≤ d(P,R) + d(R,Q) (Triangle inequality)

If (i)-(v) hold, d is called a metric.

27/30

Distances: some examples

Distance between points x = (x1, . . . , xp) and y = (y1, . . . , yp):

I Euclidean distance:√(x1 − y1)2 + (x2 − y2)2 + . . .+ (xp − yp)2 =√(x− y)T (x− y)

I Statistical distance:√

(x1−y1)2

s11+ (x2−y2)2

s22+ . . .+

(xp−yp)2

spp

takes the variances of the variables into account. Same asEuclidean distance for standardized data.

I Mahalanobis distance:√

(x− y)TS−1n (x− y) also takes the

covariances/correlations into account. If the variables areuncorrelated, this reduces to the statistical distance.

28/30

Linear algebra

Sections 2.1-2.5, 2.7 and 2A of Johnson & Wichern will not bediscussed during the lectures. These contain ”well-known” resultsfrom linear algebra. You might want to go through those sectionson your own.

The following topics and results are of particular interest to us:

I Result 2A.14 on page 100: how to express a matrix using itseigenvalues and eigenvectors.

I Positive definite matrices.

I Matrix ranks.

29/30

Summary

I Multivariate data and matricesI n measurements of p variables, stored in a matrix.

I Descriptive statisticsI Sample means and variances are calculated for each variable.I Covariances and correlations between pairs of variables.I Stored in arrays.

I Graphical methodsI 3D plotsI Scatter plotsI Bubble plotsI StarsI Chernoff facesI Andrews’ curves

I DistancesI Modify Euclidean distance to account for correlations and

differences in variance.

30/30

Documents

Lecture 1: Multivariate Data - Uppsala Universitythulin/mm/L1.pdfMultivariate data and matrices General situation: p variables are measured on n subjects (patients, items, countries,