Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Lecture 1: Multivariate Data
Mans Thulin
Department of Mathematics, Uppsala University
Multivariate Methods • 22/3 2011
1/30
Outline
I Multivariate data and matricesI Notation and basic facts
I Descriptive statisticsI Generalizations of univariate means, variances...
I Graphical methodsI How to visualize multivariate data sets
I DistancesI Is there a need to go beyond Euclid?
I Linear algebraI What’s important in Chapter 2?
2/30
Multivariate data and matrices
General situation: p variables are measured on n subjects(patients, items, countries, ...). That is, we have n observations ofa p-variate random variable.
Ex: Height, weight and age are measured for 23 people. Thenp = 3 and n = 23.
The data is stored in an array (a matrix):
X =
x11 x12 . . . x1p
x21 x22 . . . x2p...
.... . .
...xn1 xn2 . . . xnp
Row j contains the p measurements for subject j .xjk = measurement k for subject j .
3/30
Descriptive statistics: mean and variance
For univariate data, we usually look at summary statistics, such asthe sample mean x and the sample variance s2.We can calculate these as usual for each of the p variables:
xk =1
n
n∑j=1
xjk
s2k = skk =
1
n − 1
n∑j=1
(xjk − xk)2, k = 1, 2, . . . , p
The marginal sample variances s2k are usually put in a sample
covariance matrix together with the sample covariances.
Beware: Notational hazard! When we look at the samplecovariance matrix, s2
k is usually denoted skk , without the square!The sample standard deviation for variable k is denoted
√skk .
4/30
Descriptive statistics: covariance and correlation
The sample covariance between the variables k and ` is defined as
sk` = s`k =1
n − 1
n∑j=1
(xjk − xk)(xj` − x`), k, ` = 1, 2, . . . , p
It measures the linear association between the variables. It iscommon to rescale the covariance by dividing by the standarddeviations of the variables. The number thus obtained is thesample correlation:
rk` = r`k =sk`√
skk√s``
Note that rkk = 1.
5/30
Descriptive statistics: arrays
Sample mean: x =
x1
x2...xp
Sample covariance matrix: Sn =
s11 s12 · · · s1p
s12 s22 · · · s2p...
.... . .
...s1p s2p · · · spp
Sample correlation matrix: R =
1 r12 · · · r1pr12 1 · · · r2p...
.... . .
...r1p r2p · · · 1
6/30
Graphical methods
”A picture is worth a thousand words...”
Some useful approaches for visualizing multivariate data are:
I 3D plots
I Scatter plots
I Bubble plots
I Stars
I Chernoff faces
I Andrews’ curves
(on the other hand, 1001 words are worth more than a picture...)
7/30
Graphical methods: 3D plotsThree dimensional scatter plots:
−3 −2 −1 0 1 2 3 4
−3
−2
−1
0 1
2 3
−3
−2
−1
0
1
2
xyz[,1]
xyz[
,2]xy
z[,3
]
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
8/30
Graphical methods: 3D plots
I Often a good choice for three dimensional data.
I Enables us to see patterns that would disappear if the datawas projected to two dimensions.
I Best when interactive – i.e. when the plot can be rotated.
9/30
Graphical methods: Scatter plotsScatter plots of each pair of variables, usually combined with themarginal histograms.
x_1
6.0 7.0 8.0 9.0
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●●●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●●●●
●
●
●●
●
●
●●
● ●
●
●
●
5 6 7 8 9
67
89
10
●
●●
●
●
●
●
●●
●
●
●●●
●
●
●●
●
●
●
●
●●
●●
●●●
●
●
●
●●
●
●
●●
● ●
●
●
●
6.0
7.0
8.0
9.0
●
●
●
●
●
●●
●
●
●●●
●
●
● ●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
x_2●
●
●
●
●
●●
●
●
● ●●
●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●
● ●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
x_3
56
78
9
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
6 7 8 9 10
56
78
9
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
5 6 7 8 9
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
x_4
10/30
Graphical methods: Scatter plots
I Can, unlike 3D plots, be used for p greater than 3.
I Usually a good first choice for plotting.
I Good for detecting dependencies between pairs of variables.
I Higher-dimensional perspective is lost – importantdependencies between more than two variables might gounnoticed.
I Useful for assessing multivariate normality.
11/30
Graphical methods: Bubble plots
In a regular 2D plot, a third dimension can be illustrated usingbubbles of different sizes.
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
0 50 100 150 200 250 300
850
900
950
1000
1050
1100
1150
Bubble plot
SO2[1:15]
Mor
talit
y[1:
15]
12/30
Graphical methods: Bubble plots
In a regular 2D plot, a third dimension is illustrated using bubblesof different sizes.
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
0 50 100 150 200 250 300
850
900
950
1000
1050
1100
1150
Bubble plot
SO2[1:15]
Mor
talit
y[1:
15]
●
13/30
Graphical methods: Bubble plots
I Simple, but easy to interpret.
I Possible to make nice-looking plots.I Possible extensions to higher dimensions?
I 3D bubble plots, colours of bubbles, shapes of bubbles, timedimension...
14/30
Graphical methods: Stars
The length of the rays from the center of the figure represent thevalues of the variables. Example with p = 7 and n = 9:
Motor Trend Cars
Mazda RX4Mazda RX4 Wag
Datsun 710
Hornet 4 DriveHornet Sportabout
Valiant
Duster 360Merc 240D
Merc 230
Different versions of stars can be found in the literature.
15/30
Graphical methods: Stars
The lengths of the rays from the center of the figure represent thevalues of the variables. Example with p = 7 and n = 9:
Motor Trend Cars
Mazda RX4Mazda RX4 Wag
Datsun 710
Hornet 4 DriveHornet Sportabout
Valiant
Duster 360Merc 240D
Merc 230
Different versions of stars can be found in the literature.
16/30
Graphical methods: Stars
I Can be used for dimensions higher than three.
I Relatively easy to interpret.
I Can be useful for finding similar data points.
I If plotted in a time sequence, can illustrate change over time.
I Only useful for relative comparisons.
17/30
Graphical methods: Chernoff facesThe human mind is extremely good at facial recognition. Chernoffproposed illustrating data sets with facial features.Herman Chernoff (1973), The use of faces to represent points in k-dimensional space graphically, Journal of the
American Statistical Association, 68, pp. 361-368.
Index
akronOH
Index
albanyNY
Index
allenPA
Index
bufaloNY
Index
cantonOH
Index
chatagTN
18/30
Graphical methods: Chernoff faces
I Each variable is represented by a facial feature (e.g. length ofnose, size of eyes, width of head...).
I Easy to see groups or clusters in the data.
I Can be used to find outliers.
I If plotted in a time sequence, can illustrate change over time.
I Can be used for p ≤ 18.
I Care must be taken when choosing which variable isrepresented by which facial feature. We react stronger tochanges in some of the features!
I Useful or just plain stupid?
I Does the implementation in R work properly? See Computerexercise 1.
19/30
Graphical methods: Andrews’ curvesAndrews proposed that the observations should be projected ontoa p-dimensional space of functions, since we are used to comparingfunctions. He used the observations as Fourier coefficients andplotted the corresponding functions in the interval 0 < t < 2π:
fx(t) = x1/√
2 + x2 sin t + x3 cos t + x4 sin 2t + x5 cos 2t + . . .
D.F. Andrews (1973), Plots of High-Dimensional Data, Biometrics, 28, pp. 125-136
0 1 2 3 4 5 6
05
1015
Andrews' Curves
setosaversicolorvirginica
20/30
Graphical methods: Andrews’ curves
I Useful for finding clusters – subgroups of data arecharacterized by similar curves.
I Can be used to find linear relationships – if a point y lies on aline between x and z then fy(t) lies between fx(t) and fz(t)for all t.
I Can illustrate data with very high dimensions.
I Can be used for testing (see original article).
I Becomes cluttered when there are too many data points (as inthe picture on the previous slide?).
I Perhaps not as intuitive as some of the other methods – moreuseful to mathematicians than to practitioners?
21/30
Geographical data
The social network Facebook store lots of data about their usersand their online activities.
Such large databases – or parts thereof – can be difficult tovisualize.
In December 2010 the Facebook infrastructure engineering teamused R to create a map of Facebook friendships. Lines betweencities represent friendships between the cities’ inhabitants.
22/30
Geographical data
23/30
Geographical data
24/30
Geographical data
25/30
DistancesWhy statistical distances?
I Account for differences in variationI Account for presence of correlation
●●
●
●
●
●●
●
●
● ●●
●● ●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
−15 −10 −5 0 5 10 15
−5
05
10
x
y
26/30
Distances: definition
Definition. Let P and Q be two points, where these mayrepresent measurements x and y on two objects. A real-valuedfunction d(P,Q) is a distance function if it satisfies the followingproperties:
(i) d(P,Q) = d(Q,P) (Symmetry)
(ii) d(P,Q) ≥ 0 (Non-negativity)
(iii) d(P,P) = 0 (Identification mark)
For many distance functions the following properties also hold:
(iv) d(P,Q) = 0 iff P = Q (Definiteness)
(v) d(P,Q) ≤ d(P,R) + d(R,Q) (Triangle inequality)
If (i)-(v) hold, d is called a metric.
27/30
Distances: some examples
Distance between points x = (x1, . . . , xp) and y = (y1, . . . , yp):
I Euclidean distance:√(x1 − y1)2 + (x2 − y2)2 + . . .+ (xp − yp)2 =√(x− y)T (x− y)
I Statistical distance:√
(x1−y1)2
s11+ (x2−y2)2
s22+ . . .+
(xp−yp)2
spp
takes the variances of the variables into account. Same asEuclidean distance for standardized data.
I Mahalanobis distance:√
(x− y)TS−1n (x− y) also takes the
covariances/correlations into account. If the variables areuncorrelated, this reduces to the statistical distance.
28/30
Linear algebra
Sections 2.1-2.5, 2.7 and 2A of Johnson & Wichern will not bediscussed during the lectures. These contain ”well-known” resultsfrom linear algebra. You might want to go through those sectionson your own.
The following topics and results are of particular interest to us:
I Result 2A.14 on page 100: how to express a matrix using itseigenvalues and eigenvectors.
I Positive definite matrices.
I Matrix ranks.
29/30
Summary
I Multivariate data and matricesI n measurements of p variables, stored in a matrix.
I Descriptive statisticsI Sample means and variances are calculated for each variable.I Covariances and correlations between pairs of variables.I Stored in arrays.
I Graphical methodsI 3D plotsI Scatter plotsI Bubble plotsI StarsI Chernoff facesI Andrews’ curves
I DistancesI Modify Euclidean distance to account for correlations and
differences in variance.
30/30