Capítulo 3halweb.uc3m.es/esp/Personal/personas/dpena/... · Capítulo 3 DESCRIPTION OF MULTIVARIATE DATA P. C. Mahalanobis (1893-1972) An Indian statistician, Mahalanobis studied

28

Capítulo 3

DESCRIPTION OFMULTIVARIATE DATA

P. C. Mahalanobis (1893-1972)An Indian statistician, Mahalanobis studied physics in Calcutta and Cambridge. He be-

came interested in statistics as a way of solving economic and cultural problems in India. In1931, he founded the Indian Statistical Institute, one of the �rst centers in the world to teachstatistics to researchers from all �elds, and Sankhya: The Indian Journal of Statistics. He wasthe principal designer of the development of the �ve-year plan following the independenceof India.

3.1. INTRODUCTION

In this chapter and the next, we will look at how to describe a set of multivariate data.Let�s suppose that we have observed a set of variables in a sample of elements of a population,and in this chapter we will present methods which allow us to summarize the values ofthe variables and to describe their dependence structure. In the following chapter we willcomplete this descriptive analysis by analyzing how to graphically represent the data andto choose those transformations of the original variables which will lead us to a simplerdescription. We will also discuss the problem of cleaning the data of atypical values, whichare observations resulting from measurement errors or other causes of heterogeneity.The descriptive analysis presented in this chapter must always be applied as a �rst step

in order to understand the structure of the data and to extract the information they containbefore moving on to the more complex methods shown in the following chapters. The simpletools described in these two chapters can, on occasions, solve the problem for which the datahas been collected. In particular, when the interest is centered in the relationship betweenthe variables or in the comparison of two sets of data, the descriptive methods can be ofgreat use before setting out on a more complex study.In the description of multivariate data, an important concept is the distance between two

points. The most widely used distance in Statistics is that of Mahalanobis.

29

30 CAPÍTULO 3. DESCRIPTION OF MULTIVARIATE DATA

EC x1 x2 x3B 1 0 0G 0 1 0Br 0 0 1Bl 0 0 0

Cuadro 3.1: Codi�cation of categoric variables

3.2. MULTIVARIATE DATA

3.2.1. Types of variables

The basic information for the methods studied in this book can be of various types.The most typical is a table containing the values of p variables observed in n elements. Thevariables can be quantitative, when their value is expressed numerically, such as the ageof a person, their height or their income, or qualitative when their value can be attributedto a category such as gender, eye color or city of birth. Quantitative variables can then beclassi�ed as continuous or as intervals, when the real value can be read as an interval, suchas height, or discrete, when the values belonging to it are distinct and separate, such as thenumber of siblings. Qualitative variables can be classi�ed as binaries, when there are onlytwo possible values, such as gender (male, female) or general, when many values are possiblesuch as city of residence.Let�s suppose from here on that the binary variables have been coded as numerical (for

example, the gender variable converts to numerical by assigning a zero to male and a oneto female). Qualitative variables can also be assigned a numerical value, but the procedureis di¤erent. If the values of the categories are not related, the easiest way is to code themis by converting them into binary variables. For example, let�s suppose that the variable iseye color, EC, and to simplify matters, we will suppose that the possible categories are blue(B), green (G), brown (Br) and black (Bl). We have p = 4 categories that we can representwith p� 1 = 3 binary variables de�ned as:a) x1 = 1 if EC=B, x1 = 0 otherwise.b) x2 = 1 if EC=G, x2 = 0 otherwise.c) x3 = 1 if EC=Br, x3 = 0 otherwise.Table 3.1 presents the codi�cation of the variable attributed to EC in the three quanti-

tative binary variables, x1; x2; x3If the number of possible classes of a qualitative variable is large, this procedure can be

applied but can logically lead to a great number of variables. It is a good idea to group theclasses or categories in order to avoid having variables which will almost always have thesame value (zero if the category is infrequent or one if it appears often).The variable of EC could also be coded by giving arbitrary numerical values to the

categories, for example, B=1, G=2, Br=3, Bl=4, but this system has the inconvenience ofsuggesting a gradation of values which may not exist. Nevertheless, when the attributes canbe interpreted as a function of the value of a continuous variable, it makes more sense tocode it with numbers which indicate the order of the categories. For example, if we havesmall, medium and large companies based on the number of workers, it makes sense to code

3.2. MULTIVARIATE DATA 31

with the numbers 1, 2, and 3.

3.2.2. The data matrix

We assume from here on that we have observed p numerical values in a set of n elements.Each of these p variables denotes a scalar or univariate variable and the set of p variablesforms a vectorial or multivariate variable. The values of the p scalar variables in each ofthe n elements can be represented in a matrix, X; of dimensions (n� p), which we will callmatrix of data. We will denote as xij the generic element of this matrix, which representsthe value of the scalar variable j over the individual i. In other words:

data xij where i = 1; :::; n represents the individual;

j = 1; :::; p represents the variable

Some examples of data which are used in multivariate data analysis are:

1. Of 100 university students, we take the age, gender (1 female, 0 male), grade average,city of residence (which will be classi�ed in 4 categories by size) and their year ofstudy. The initial data is represented in a table of 100 rows, each corresponding tothe data of a student. The table will have 5 columns, each of those containing thevalues of the 5 de�ned variables. Of these 5 variables 3 are quantitative, one is binary(gender) and the other is a general qualitative variable (city of residence, which willbe shown by the values 1, 2, 3 and 4). Alternatively, we could also classify the city ofresidence with three binary variables. The matrix of data will then have n =100 rowsand p = 7 columns corresponding to the three quantitative variables, gender, and thethree additional binary variables used to describe the size of the city of residence.

2. In each of the 138 companies in an area we measure the number of workers, turnover, theindustrial sector and the amount received in o¢ cial subsidies. If we classify the sectorin eight classes with seven binary variables, the matrix of data will have a dimensionof 138 � 10; with three quantitative variables and seven binary (which describe theindustrial sector).

3. In 400 points in a city, we install controls which provide hourly measurements of 30environmental and atmospheric pollution variables from each point. Each hour we havea matrix of data with 400 rows, the observation points, and 30 columns, the 30 observedvariables.The matrix of data , X; can be represented in two distinct ways. By rows, as:

X =

26664x11 x12 : : : x1px21 : : : : : : x2p...xn1 : : : : : : xnp

37775 =26664x01......x0n

37775


where each variable x0i is a row vector, p � 1; which represents the values of p variablesover the individual i. Alternatively, we can represent the matrix X by columns:

X =�x(1) : : :x(p)

�where each variable x(j) is a column vector, n � 1; which represents the scalar variablexj measured in the n elements of the population. We will denote by x = (x1; :::; xp)0 themultivariate variable formed by the p scalar variables which takes the values x1; :::;xn; inthe n observed elements.

3.2.3. Univariate analysis

Describing multivariate data means studying each variable separately as well as the re-lationship between them. We will assume that the reader is familiar with the descriptiveanalysis of a variable, and here we will only deal with the formulas which will be used inother parts of the book. The univariate study of a scalar variable xj implies calculating itsmean:

xj =1

n

nXi=1

xij

which for a binary variable this is the relative frequency with which the attribute appearsand for a numerical variable this is its center of gravity or geometrical center of the data. Themeasurement of variability is calculated with relation to the mean, averaging the deviationsbetween the data and its mean. If we de�ne the deviations using dij = (xij�xj)2 , where thesquare is taken in order to have always positive deviations, the standard deviation is de�nedby:

sj =

rPni=1 dijn

=

rPni=1(xij � xj)2

n(3.1)

and its square is the variance, s2j =Pn

i=1 dij=n . In order to compare the variability ofdi¤erent variables, it is advisable to construct relative variability measurements that do notdepend on the units of measurement. One of these measurements is the variation coe¢ cient

CVj =

ss2jx2j

where we suppose that xj is di¤erent from zero. Thirdly, it is advisable to calculate theasymmetry coe¢ cient which measures the symmetry of the data with respect to its centerand is calculated as:

Aj =1

n

P(xij � xj)3

s3j:

This coe¢ cient is zero for a symmetrical variable. When the absolute value of a coe¢ cientis approximately greater than one, we can conclude that the data have a clearly asymmetricdistribution.An important characteristic of the set of data is its homogeneity. If the deviations dij are

quite distinct, this suggests that there are data which are far removed from the mean and

3.2. MULTIVARIATE DATA 33

that there is a high degree of heterogeneity. One possible measurement of homogeneity isthe variance of the dij; denoted by:

1

n

nXi=1

(dij � s2j)2

since, according to (3.1), the mean of the deviations dj = s2: An adimensional measurementanalogous to the variation coe¢ cient is calculated by dividing the variance of the deviationsby the square of the mean, s4; so that we have the homogeneity coe¢ cient which can beexpressed as

Hj =1n

Pni=1(dij � s2j)2

s4j:

This coe¢ cient is always greater than or equal to zero. Developing the square of the numer-ator as

Pni=1(dij � s2j)2 =

Pni=1 d

2ij +ns

4j � 2s2j

Pni=1 dij =

Pni=1 d

2ij �ns4j ; this coe¢ cient can

also be written as:

Hj =1

n

P(xij � xj)4

s4j� 1 = Kj � 1:

The �rst member of this equation, Kj; is an alternative way of measuring homogeneity andis known as the kurtosis coe¢ cient. Just as Hj � 0; the kurtosis coe¢ cient will be greaterthan or equal to one. Both coe¢ cients measure the relationship between the variability ofthe deviations and the mean deviation. It is easy to demonstrate that:1. If there are some atypical data far removed from the rest, the variability of the de-

viations will be great because of these values, and the kurtosis coe¢ cient Kj or the ofhomogeneity coe¢ cient Hj will be high.2. If the data are separated into two halves corresponding to two distributions quite

separate from each other, that is, if we have two separate sets of distinct data, the mean ofthe data will be equidistant from the two groups of data and the deviations of all the datawill be similar so that the coe¢ cient Hj will be quite small (zero in the extreme case wherehalf of the data is equal to any number, �a , and the other half equal to a).A central objective in the description of data is to decide if the data are a homogeneous

sampling of a population or whether they correspond to a mix of di¤erent populations whichshould be studied separately. As we will see in the next chapter, an especially importantcase of heterogeneity is the presence of a small proportion of atypical observations (outliers),which correspond to observations heterogenous with the rest. The detection of these outliersis fundamental for a correct description of the majority of the data since, as we will see,these extreme values distort the descriptive values of the set. The kurtosis coe¢ cient canhelp with this objective since with large outliers it will take on a high value, greater than7 or 8. For example, if we contaminate data coming from a normal distribution with 1% ofoutliers generated by another normal distribution with the same mean but with a variance20 times greater, the kurtosis coe¢ cient will be around 10. Whenever we observe a largevalue in the kurtosis for a variable, this implies heterogeneity caused by a few outliers farremoved from the rest.A di¤erent type of heterogeneity appears when we have a mix of two populations, so that

a large proportion of the data, between 25% and 50%, are heterogenous with the rest. In


this case, the kurtosis coe¢ cient will be small, less than two. It is easy to show that if we mixtwo very di¤erent distributions using equal parts, when the separation increases between thetwo populations, the kurtosis of the resulting distribution tends towards one, the minimumvalue of the coe¢ cient.The possible presence of outliers makes it advisable to calculate, together with the tra-

ditional statistics, robust measurements of centralization and dispersion. For centralization,it is useful to calculate the median, which is the value halfway through the ordered dataset, below and above which there lies an equal number of data values. For dispersion wecalculate the MEDA, which is the median of the absolute deviations with respect to themedian. Finally, it is always advisable to plot the continuous variables using a histogram ora box plot. In the initial analysis of the data it is always advisable to calculate the mean andmedian of each variable. If both are similar, the mean is a good indicator of the center ofthe data. Nevertheless, if they di¤er greatly, the mean might not be a good indicator due to:(1) an asymmetric distribution, (2) the presence of atypical values (which a¤ect the meangreatly and the median hardly at all), (3) heterogeneity in the data. Next we will move on tothe multivariate analysis of the observations. In this chapter we present how to obtain jointmeasurements of centralization and dispersion for the set of variables and measurements oflinear dependence between pairs of variables and among all variables.

3.3. MEASURES OF LOCATION: THEMEANVEC-TOR

The most often used measure of centralization to describe multivariate data is the meanvector. This is a vector of p dimensions whose components are the means of each of the pvariables. It can be calculated, as in the scalar case, by averaging the measurements of eachelement, which are now vectors:

x =1

n

nXi=1

xi =

264 x1...xp

375 (3.2)

It is expressed from a data matrix as:

x =1

nX01; (3.3)

where 1 always denotes a vector of ones of the appropriate dimension. Writing the matrixX in terms of its row vectors, which are vectors of dimension 1 � p and which contain thevalues of the p variables in each element of the sample, the columns of X0; we have:

x =1

n[x1 : : :xn]

264 1...1

375 ; (3.4)

3.3. MEASURES OF LOCATION: THE MEAN VECTOR 35

which leads to (3.2). The mean vector is in the center of balance of the data, and it has theproperty whereby the sum of its deviations is zero:

nXi=1

(xi � x) = 0:

Writing the sum asPn

i=1 xi�nx; and applying de�nition (3.2) it is immediately shown thatthe sum is zero.The scalar measures of centralization based on the ranking of observations cannot be

easily generalized for multivariate variables. For example, we can compute the medians ofthe vectors, but this point is not necessarily the center of the data. This di¢ culty is causedby the lack of a natural ranking in multivariate data.Example:The MEDIFIS data in the medi�s.dat �le show eight body measurement variables taken

from a group of 27 students. The variables are sex (indicated by 0 for female and 1 for male),height (ht. in cm.), weight (wt. in kgr.), foot length (ftl. in cm.), arm length (arml. in cm.),back width (bwth. in cm.), diameter of cranium (crd. in cm.) and length from knee to ankle(kn-al. in cm.).Table 3.2 gives the means and standard deviations of the variables as well as other

univariate statistics for each variable.

sex ht wt ftl arml bwth crd kn-alMeans .44 168.8 63.9 39.0 73.5 45.9 57.2 43.1St. deviations .53 10.0 12.6 2.8 4.9 3.9 1.8 3.1Asym. coe¤. .22 .15 .17 .27 .37 -.22 .16 .56Kurtosis coe¤. 1.06 1.8 2.1 1.9 2.1 2.4 2.0 3.4Coe¤. variation 1.2 .06 .20 .07 .07 .09 .03 .07

Cuadro 3.2: Descriptive analysis of physical measurements

The mean of the binary variable sex, is the proportion of ones (male) in the data and thestandard deviation is

pp(1� p); where p is the mean. The reader can check that for binary

variables the kurtosis coe¢ cient is

p3 + (1� p)3p(1� p)

and, in this case, since p = ;44 the kurtosis coe¢ cient is 1.06. If we look at the coe¢ cients ofvariation we observe that in the measurements of length, such as height, foot, arm and leglength, which are determined more by heredity than lifestyle, the relative variability is small,in the range 03-07. The relative variability of the variables which depend on lifestyle, suchas weight, is much greater, 20%. The distributions are approximately symmetric, judgingby the low values of the asymmetry coe¢ cients. The kurtosis coe¢ cients are low, less thanor equal to two for three of the variables, which may indicate the presence of two mixedpopulations, as we will see in Section 3.6. None of the variables has a high kurtosis, thusallowing us to rule out the presence of a few large outliers.


Table 3.3 shows two robust measurements, the median and the MEDA, or median ofabsolute deviation, for each variable. These measurements con�rm the above comments.

ht wt ftl arml bwth crd kn-almedians 168 65 39 73 46 57 43MEDAs 8.51 10.50 2.38 3.96 3.26 1.52 2.39meda/median .05 .16 .05 .05 .07 .03 .06

Cuadro 3.3: Robust descriptive analysis of physical measurements

We observe that the medians are quite similar to the means and the MEDAs to the stan-dard deviations, which suggests a lack of extreme values. The robust coe¢ cients of variation,calculated as a ratio between the MEDA and the median, are also basically similar to theabove. It should be pointed out that, in general, the MEDA is smaller than the standarddeviation, and therefore, these variation coe¢ cients will be smaller than the originals. Whatis important is that the structure between the variables is similar. Figure 3.1 gives the his-togram of the height variable, and we can see that the data seem to be the mixture of twodistributions. This is to be expected, as we have men and women together.

Figura 3.1: Histogram of height showing a mixed distribution.

3.4. THE MATRIX OF VARIANCES AND COVARI-ANCES

As we have mentioned, for scalar variables the variability with respect to the mean isusually measured by the variance, or its square root, the standard deviation. The linear

3.4. THE MATRIX OF VARIANCES AND COVARIANCES 37

relationship between two variables is measured by the covariance. The covariance betweenvariables xj; xk is computed by:

sjk =1

n

nXi=1

(xij � xj)(xik � xk)

and measures its linear dependence.For a vector variable, its matrix of variances and covariances is de�ned as:

S =1

n

nXi=1

(xi � x)(xi � x)0 (3.5)

which is a square and symmetric matrix containing in its diagonal the variances and outsidethe diagonal, the covariances between the variables. In fact, when the vectors of deviationsare multiplied, as:

264 xi1 � x1...xip � xp

375 [xi1 � x1; : : : ; xip � xp] =264 (xi1 � x1)

2 : : : (xi1 � x1)(xip � xp)...

...(xip � xp)(xi1 � x1) : : : (xip � xp)2

375we obtain the matrix of squares and crossed products of the p variables in element i. Bysumming for all the elements and dividing by n we obtain the variances in the diagonal andthe covariances outside of it. The matrix of variances and the covariance, which from hereon for simplicity�s sake we will call the covariance matrix, is the symmetric matrix of orderp with form:

S =

264 s21 ::: s1p...

...sp1 : : : s2p

375 :3.4.1. Calculations based on the centered data matrix

The matrix S can be obtained directly using the centered data matrix eX, which is de�nedas the matrix obtained by subtracting the mean from each variable:

eX = X� 1x0:

Replacing the mean vector with its expression (3.3):

eX = X� 1

n110X = PX; (3.6)

where the square matrix P is de�ned by

P = I� 1

n110;


and is symmetric and idempotent (the reader can check that PP = P). The matrix P hasrank n� 1 (it is orthogonal to the space de�ned by the vector 1, since P1 = 0) and projectsthe data orthogonally over the space de�ned by the constant vector (with all the coordinatesequal). Then, the matrix S can be written:

S =1

neX0 eX =

1

nX0PX: (3.7)

Some authors de�ne the covariance matrix by dividing by n � 1 instead of n; to obtainan unbiased estimator of the population matrix. This divisor appears, as in the univariatecase, because we do not have n independent deviations, but rather only n� 1: Note that then vectors of deviations, (xi � x); are bound by the equation

nXi=1

(xi � x) = 0

and we can only calculate n � 1 independent deviations. In this book we will denote thecorrected covariance matrix, bS; by the expression:

bS = 1

n� 1eX0 eX

Example:The ACCIONES data from the acciones.dat �le include three measures of return on

34 stocks trading on the stockmarket during a period of time. The �rst, x1; is the rate ofreturn on investment (dividends divided by stock price), x2 is the proportion of earningsdistributed in dividends (shared pro�t in dividends over total pro�t) and x3 is the price toearnings ratio. Table 3.4 shows the descriptive measurements of the three variables.

x1 (return) x2 (prof.) x3 (price)Means 9.421 69.53 9.097St. deviations 5.394 24.00 4.750Asym. coe¤. 0.37 0.05 2.71Kurtosis coe¤. 1.38 1.40 12.44

Cuadro 3.4: Descriptive analysis of stock returns

The asymmetry and kurtosis coe¢ cients indicate that the three variables do not followthe normal distribution: the �rst two have very low kurtosis values, which indicates highheterogeneity, possibly due to the presence of the two groups of di¤erent data, and the thirdhas a high kurtosis, which suggests the presence of outliers.These characteristics are clearly seen in the histograms of the variables in Figures 3.2

and 3.3. The �rst variable, x1; shows two groups of stocks with very di¤erent behavior.The histogram of the second variable, x2; also shows two groups of stocks. Finally, thethird variable (Figure 3.3) is quite asymmetric, with a very noticeable outlier. The availableevidence indicates that the stocks can probably be divided into two more homogeneousgroups. Nevertheless, we will illustrate the analysis using all the data.


Histogram of the rate of return on investment.

Figura 3.2: Histogram of the proportion of pro�t distributed as dividends.

The covariance matrix of these three variables is shown in Table 3.5

X1 X2 X3

29.1 100.4 -15.7100.4 576 -18.5-15.7 -18.5 22.6

Cuadro 3.5: Covariance matrix of stocks

The diagonal elements of this matrix are the squares of the standard deviations of Table


Figura 3.3: Histogram of the price to earnings ratio.

3.4. Since the dimensions of the variables are di¤erent, it makes no sense to calculate theaverage measurements of the variances or covariances.The histograms of the three variables have shown a clear lack of normality. One possibility,

which we will study in more detail in the next chapter, is to transform the variables in orderto facilitate their interpretation. Taking logarithms, the covariance matrix of the transformedvariables is indicated in table 3.6

logx1 logx2 logx3.35 .15 -.19.15 .13 -.03-.19 -.03 .16

Cuadro 3.6: Covariance matrix of stocks

We observe that the logarithms greatly modify the results. The numbers are now homo-geneous and the variable with the greatest variance is now the �rst, the logarithm of rate ofreturn, whereas the least is the second, the logarithm of the proportion of pro�t distributedas dividends. The relationship between the logarithm of the price to earnings ratio (X3)and rentabilidad efectiva is negative. The other relationships are weak. An additionaladvantage of the logarithms is that they provide measures of variability independent of thescale of measurement: if we multiply the variables by a constant when taking the logarithmsthis is equivalent to adding up a quantity which does not alter its variability. Therefore, thevariances of the variables in logarithms can be compared even though the data have di¤erentdimensions. The mean variance of the three variables is

V ar =;35 + ;13 + ;16

3= ;213

which is a reasonable measurement of global variability.


3.4.2. Properties

Just as the variance is always a non-negative number, the covariance matrix has a similarproperty: it is positive semide�nite. This property ensures that if y is any vector, y0Sy � 0.The trace, the determinant and the eigenvalues of this matrix are non-negative as well.

Proof

Letting w be any vector of dimension p, we de�ne the scalar variable:

vi = w0(xi � x): (3.8)

The mean of this variable is:

v =1

n

nXi=1

vi =1

nw0

nXi=1

(xi � x) = 0;

and its variance must be non-negative, thus:

V ar(v) =1

n

nXi=1

v2i =1

n

nXi=1

[w0(xi � x)][(xi � x)0w] � 0

= w0Sw � 0:

Since the above equation is valid for any vector w, we conclude that S is positive semi-de�nite. This condition also implies that if Swi = �iwi;, then �i � 0. Finally, all minors arenon-negative (in particular jSj � 0).

3.4.3. Redundant variables: The case with a Singular S Matrix

We are going to analyze the consequences of having the matrix S be singular. We seethat if there is a vector w such that w0Sw = 0, then the variable (3.8) has zero varianceand, as its mean is zero, this variable will always take the value of zero. Therefore, for any i:

pXj=1

wj(xij � xj) = 0 8 i:

This equation implies that the p variables are not independent, since we can obtain anyone as a function of the others:

xi1 = x1 �w2w1(xi2 � x2)� :::�

wpw1(xip � xp):

Therefore, if there is a vector w that makes w0Sw = 0, there is a linear relationshipbetween the variables. The opposite is also true. If there is a linear relationship between thevariables we can state w0(xi � x) = 0, for all elements, that is

eXw = 0;


multiplying this expression on the right by the matrix eX0 and dividing by n:

1

neX0 eXw = Sw = 0: (3.9)

This condition implies that the matrix S has a characteristic root or eigenvalue equal tozero and w is the characteristic vector associated with the characteristic root zero. Multiply-ing in (3.9) by w0 we get (eXw)0(eXw) = 0; which implies that eXw = 0; and we conclude thata variable is an exact linear combination of the others. As a result, it is possible to reducethe dimensionality of the system by eliminating this variable. Furthermore, we see that thecoordinates of the vector w indicate a redundant linear combination.Example:The following covariance matrix corresponds to four variables simulated such that three

of them are linearly independent, but the fourth is the average of the �rst two.

S =

2664;0947 ;0242 ;0054 ;0594;0242 ;0740 ;0285 ;0491;0054 ;0285 ;0838 ;0170;0594 ;0491 ;0170 ;0543

3775The eigenvalues of this matrix calculated with Matlab are (0; 172 97; 0; 08 762; 0; 04 617

and 0; 00005). The lowest eigenvalue is practically zero compared with the other three, sothat the matrix has, very approximately, a rank equal to 3. The eigenvector associated withthis null eigenvalue is ( .408 .408 .000 -.816 ). Dividing by the largest term this eigenvectorcan be written as (.5 .5 0 -1), which reveals that the lack of a complete rank of the covariancematrix is due to the fact that the fourth variable is the average of the �rst two.Example:The EUROSEC data found in the eurosec.dat �le include 26 countries which were mea-

sured for the percentage of the population dedicated to each of nine economic sectors. Thecalculations of the eigenvalues of the covariance matrices corresponding to this data are pre-sented next and we can see that there is an eigenvalue very close to 0 (.0019). This eigenvalueis not exactly zero because in the table of data the sum of the rows is not exactly 100% inall cases due to rounding errors (this varies between 99.8% and 102%)

0.0019 0.0649 0.4208 1.0460 2.4434 5.6394 15.207 43.7017 303.458The eigenvalue 0.0019 de�nes a scalar variable with practically no variability, since the

eigenvector linked to this eigenvalue is the vector (.335, .324, .337, .339, .325, .337, .334,.332, .334). This vector is approximately the constant vector, and indicates that the sum ofall the variables results in an approximately constant scalar variable.The second eigenvalue, 0.0649, is also quite small. The corresponding eigenvector is (-

0.07, -0.29, -0.07, 0.91, 0.00, -0.10, -0.12, -0.05, -0.22). This eigenvector is determined by thefourth variable, which is more heavily weighted than the others. This suggests that the fourthvariable, the energy sector, must have a similar weight in all the countries. The covariancematrix of the variables is:


241;60;53 0;94�73;11 3;02 49;10�2;33 0;14 1;01 0;14�13;77 �0;04 5;70 0;03 2;70�52;42 1;76 6;53 0;34 2;68 20;939;59 �1;20 �3;06 0;11 0;07 4;69 7;87�79;29 �1;86 7;37 0;34 1;77 17;87 2;06 46;64�12;22 0;21 3;41 0;19 0;88 1;19 �0;96 5;39 1;93

37777777777775and we observe that the fourth variable has much less variability than the rest.

Generalization

This procedure can be extended for any number of null eigenvalues: if S has rank h < p,there are p�h redundant variables that can be eliminated. The vectors associated with nulleigenvalues indicate the composition of these redundant variables. If S has rank h this willbe the number of non-null eigenvalues, and there will be r = p� h vectors which verify:

Sw1 = 0...

...

Swr = 0

or, equivalently, there are r relationships of the form:

(xi � x)0wj = 0; j = 1; :::; r

which imply r exact linear combinations among the variables. We can thus represent theobservations with h = p �r variables. There are many possible representations, since anyvector of the subspace de�ned by (w1; :::;wr) can be expressed as a linear combination ofthese vectors and veri�es:

S(a1w1 + :::+ arwr) = 0

The r eigenvectors of S associated with the null eigenvalues make up an orthonormalbase (orthogonal vectors and of unit norm) in that space. We see that the linear relationshipsamong the variables are not de�ned univocally, since given two linear relationships any linearcombination of these two is also a linear relationship.An alternative way of analyzing the problem is the following. Since

S =1

neX0 eX;

the rank of S coincides with the matrix eX, because for any matrix A, if we let r(A) be therank of A, it can always be veri�ed that:

r(A) = r(A0) = r(A0A) = r(AA0):


Therefore, if the matrix eX has rank p, this will also be the rank of S. Nonetheless, ifthere are h linear combinations among the variables X, the rank of the matrix eX will be p�h, and this will be the rank of the matrix S as well.Example:Let us calculate the eigenvectors of the covariance matrix for the ACCIONES data,

acciones.dat �le, which were analyzed in example 3.2. The eigenvalues of the matrix of theoriginal values are (594.86, 29.82, 3.22) and we see that there is one very large eigenvalueand two small ones, where the smallest value is linked to the eigenvector ( 0.82, -0.13, 0.55).For the variables in logarithms the eigenvalues are (0;5208; 0; 1127 and 0;0065): There is nowan eigenvalue which is much smaller than the other two, and its eigenvector is (57, -.55, .60).In order to interpret the variable de�ned by this eigenvector, we write it as a function of

the original variables. Keeping in mind the de�nition of the variables and letting d be thedividends, p the price, B the pro�t and N the number of stocks, assuming that the majorityof pro�t will be distributed in dividends (which is only an approximation) we can write,y =.57log(d=p)� ;55 log(dN=B) + ;60 log(p=B=N)and, rounding up, this variable will be, approximately,y = ;6 log(d=p)(B=dN)(pN=B) = ;6 log 1 = 0

That is, letting Xi be the variables in logarithms, the variable de�ned by the combinationX1 �X2 +X3 must take small values. If we construct this variable starting with the data,its mean is .01 and its variance is .03, which is much smaller than that of the originalvariables. We see that this variable has little variability but it is not constant and there isno deterministic relationship between the three variables in logarithms. The shared pro�tsapart from the dividends, while small on average, are not inconsiderable for some stocks. Weobserve that this information, which is revealed through the analysis of the eigenvectors ofthe covariance matrix, could easily slip by unnoticed.

3.5. GLOBAL MEASURES OF VARIABILITY

When variables are measured in the same units (Euros, km) or are adimensional (per-centages, proportions, etc.) it is of interest to �nd measures of average variability which allowus to compare di¤erent sets of variables. First we are going to obtain these global measuresas a summary of the covariance matrix and, second, we will interpret these measures usingthe concept of distance between points.

3.5.1. Total variability and average variance

One way of summarizing the variability of a set of variables is using the trace of theircovariance matrix. The total variability of the data is de�ned by:

T = tr(S) =

pXi=1

s2i ;

3.5. GLOBAL MEASURES OF VARIABILITY 45

and the average variance by:

s2=1

p

pXi=1

s2i : (3.10)

The inconvenience of these measurements is that they do not take into account thestructure of dependence between the variables. To illustrate the problem, we assume p = 2and the extreme case in which both variables are the same but in di¤erent units. Then,the joint variability of the two variables in the plane is null, because the points are alwaysforced to be on the straight line that de�nes the change of units, and yet, s2 may be high. Ingeneral, if the dependence between the variables is very high, intuitively the joint variabilityis small since if one variable is known we can determine approximately the values of the rest.The average variance does not have this property because it does not take into account thelinear dependencies.

3.5.2. Generalized Variance

A better measure of global variability is the generalized variance, a concept due to Wilks.It is de�ned as the determinant of the covariance matrix, that is

GV = jSj

Its square root is called the generalized deviation, and has the following properties:

a) It is well de�ned since the determinant of the covariance matrix is always non-negative.

b) It is a measure of area (for p = 2), volume (for p = 3) or hypervolume (for p > 3)occupied by a set of data.

To clarify these ideas, we assume the case of p = 2. Then, S can be written as:

S =

�s2x rsxsyrsxsy s2y

�and the generalized deviation is:

jSj1=2 = sxsyp1� r2 (3.11)

If the variables are independent, the majority of their values are inside a rectangle of sides6sx, 6sy because, according to Chebychev�s theorem, at least 90% of the data is includedbetween the mean and 3 standard deviations. As a result, the area occupied by both variablesis directly proportional to the product of the standard deviations.If the variables are linearly related and the correlation coe¢ cient is not zero, the majority

of the points will tend to fall in the section around the line of regression and there will bea reduction of the area which increases as r2 increases. At its limit, if r2 = 1, all the pointswill be in a straight line, there is an exact linear relationship between the variables and theoccupied area is zero. Formula (3.11) describes this contraction of the area occupied by thepoints upon increasing the correlation coe¢ cient.


An inconvenience of the generalized variance is that it does not work for comparing setsof data with di¤erent numbers of variables since GV has the dimensions of the product of thevariables included. If we add an additional variable, uncorrelated with the rest and variances2p+1; to a set of p variables with generalized variance jSpj it is easy to prove with the resultsof the calculation of the determinant of a partitioned matrix presented in 2.3.5, that

jSp+1j = jSpj s2p+1;

and choosing the units of measurement of the variable p+ 1 we can increase or decrease thegeneralized variance at will. We will take the simplest case where the matrix S is diagonaland the variables are expressed in the same units, Euros for example. Then

jSpj = s21::::s2p

Now we assume that all the variances in Euros are greater than one. Then, if we add avariable p+ 1, the new generalized variance will be

jSp+1j = s21::::s2ps2p+1 = jSpj s2p+1 > jSpj

since s2p+1 > 1: In this case, the generalized variance will increase monotonously when newvariables are considered; that is, letting jSjj be the generalized variance of the �rst j variableswe get

jSpj > jSp�1j :::: > jS2j > s21:Next, suppose now that the variables are expressed in thousands of Euros and that with thischange all the variances are now less than one. Then, the generalized variance will decreasemonotonously when variables are included.

3.5.3. e¤ective variance

Peña and Rodríguez (2000) have proposed a global measure of variability, the e¤ectivevariance, denoted by

EV = jSj1=p (3.12)

which has the advantage that when all the variables have the same dimensions this measurehas the units of the variance. For diagonal matrices the e¤ective variance is simply thegeometric mean of the variances. We see that since the determinant is the product of theeigenvalues, the e¤ective variance is the geometric mean of the eigenvalues of the matrix S;and it will always be non-negative.As the geometric mean of a set of numbers is always less than its arithmetic mean, this

measurement will always be less than the average variance. The e¤ective variance takes intoaccount the joint dependence, since if one variable were a linear combination of the othersdue to the existence of a null eigenvalue, the measure (3.12) would be null, whereas theaverage variance, denoted by (3.10) will not be. We will see in the following chapters thatthe e¤ective variance and average variance play an important role in multivariate procedures.Analogously, we can de�ne the e¤ective deviation using:

ED = jSj1=2p :

3.6. VARIABILITY AND DISTANCES 47

Example:Starting from the covariance matrix S for the logarithms of the stocks, acciones.dat �le,

in example 3.5, we obtainjSj = 0;000382

The e¤ective variance isEV = jSj1=3 = ;0726

which we can compare with the arithmetic mean of the three variances that we computedin example 3.2:

tr(S)=3 = ;2133

As we can see, the strong dependence between the variables makes the e¤ective variance,when the covariance is taken into account, much less than when they are disregarded andthe average of the variances is calculated.For standard deviations

ED = jSj1=6 = ;269

which we can take as a global measure of variability in the original data.Example:The covariance matrix for the data of body measurements (medi�s.dat) is:

S =

100;24 104;49 26;12 44;22 33;20 10;64 26;19104;49 158;02 30;04 50;19 41;67 14;08 27;9926;12 30;04 7;91 11;66 8;86 2;79 7;4244;22 50;19 11;66 23;69 15;4 4;18 11;5533;20 41;67 8;86 15;4 15;59 4;48 7;7210;64 14;08 2;79 4;18 4;48 3;27 3;1126;19 27;99 7;42 11;55 7;72 3;11 9;61

and the e¤ective variance EV = jSj1=7 =5.7783 and (EV )1=2 = EP = 2;4038. Sincethere is enough dependence these measurements are much smaller than the averages of thevariances. For example tr(S)/7=45.48. We observe that this measurement does not have, inthis example, a clear interpretation since the variables are in di¤erent units.

3.6. VARIABILITY AND DISTANCES

An alternative procedure for studying the variability of observations is to use the conceptof distances between points. In the scalar case, the distance between the value of a variablex at a point, xi; and the mean of the variable, x; is measured naturally using

p(xi � x)2;

or, what is equivalent, by the absolute value of the di¤erence, jxi � xj : The variance is anaverage of the distances to the square between the points and their mean. When we havea vector variable, each piece of data is a point in <p; and we can construct measures ofvariability by averaging the distances between each point and the vector of means. Thisrequires generalizing the concept of distance to spaces of any dimension; a concept whichwill be important in the following chapters.


3.6.1. The concept of distance

Given two points, xi, xj belonging to <p, we say that we have established a distance, ormetric, between them if we have de�ned a function d with the following properties:

1. d : <p �<p ! <+; that is, given two points in the space of dimension p its distancewith this function is a non-negative number, d(xi;xj) � 0;

2. d(xi;xi) = 0 8i, the distance between an element and itself is zero.

3. d(xi;xj) = d(xj;xi), the distance is a symmetric function in its arguments.

4. d(xi;xj) � d(xi;xp) + d(xp;xj), the distance must verify that if we have three points,the sum of the lengths of any two sides of the triangle formed by those points mustalways be greater than the third. This property is known as the triangular inequality.

These properties generalize the intuitive notion of distance between two points on astraight line. A family of distance measurements often used in <p is the Minkowski distancemetric, which is de�ned as a function of a parameter r by

d(r)ij =

pXs=1

(xis � xjs)r!1=r

(3.13)

and the powers most often used are r = 2, which leads to the Euclidean distance or L2;

dij =

pXs=1

(xis � xjs)2!1=2

= (xi � xj)0(xi � xj)1=2;

and r = 1; which is called distance in L1:

dij = jxi � xjj01;

where 10 = (1; : : : ; 1).The most often used distance is the Euclidean, but it has the inconvenience of depending

on the units of measurement of the variables. For example, let x be the height of a personin meters and y their weight in kilograms. We compare the distance between three people:A(1;80; 80); B(1;70; 72) and C(1;65; 81): The square of the Euclidean distance from individualA to B will be:

d2(A;B) = (1;80� 1;70)2 + (80� 72)2 = ;12 + 82 = 64;01

and, analogously d2(A;C) = ;152+1 = 1;225. Therefore, with the Euclidean distance individ-ual A will be much closer to C than to B. Now, to make the numbers more similar, let us as-sume that we decide to measure height in centimeters instead of meters. The new coordinatesof the individuals are now A(180; 80); B(170; 72) and C(165; 81); and the Euclidean distancesbetween the individuals become d2(A;B) = 102 + 82 = 164 and d2(A;C) = 152 + 1 = 226.With the change in units, individual A is now closer to B than to C. The Euclidean distance


depends greatly on the units of measurement and when there is no natural unit, as in thisexample, its use is unjusti�ed.One way of avoiding the problem of units is to divide each variable by a term which

eliminates the e¤ects of scale. This leads to a family of weighted Euclidean metrics, denotedby

dij =�(xi � xj)0M�1(xi � xj)

�1=2(3.14)

whereM is a diagonal matrix used to standardize the variables and make the measurementinvariant to changes of scale. For example, if we place the standard deviations of the variablesin the diagonal of M, equation (3.14) becomes

dij =

pXs=1

(xis � xjsss

)2

!1=2=

pXs=1

s�2s (xis � xjs)2!1=2

which can be seen as a Euclidean distance where each coordinate is inversely weightedproportional to the variance. For example, if we suppose that the standard deviations ofthe variables of height and weight are10 cm and 10 kgr, the standardized squared distancesamong the above individuals are

d2(A;B) = (1 + 0; 82) = 1; 64

andd2(A;C) = (1; 52 + 0; 12) = 2; 26:

With this metric, which is more reasonable, A is closer to B than to C.In general, the matrix M may not be diagonal, but it must always be non-singular and

positive de�nite such that dij � 0. In the speci�c case in which we takeM = I the Euclideandistance is obtained again. If we useM = S we get the Mahalanobis distance which we willstudy next.

3.6.2. The Mahalanobis Distance

The Mahalanobis distance is de�ned as the distance between a point and its vector ofmeans by

di =�(xi � x)0S�1(xi � x)

�1=2The value d2i is frequently referred to as the Mahalanobis distance, instead of as the

squared Mahalanobis distance, and in this book, for simpli�cation, we will also use it, al-though strictly speaking the distance is di: We are going to interpret this distance and provethat it is a reasonable measure of distance between correlated variables. We will use the caseof p = 2: Thus, writing s12 = rs1s2; we have

S�1 =1

(1� r2)

�s�21 �rs�11 s�12

�rs�11 s�12 s�22

�and the (squared) Mahalanobis distance between two points (x1; y1); (x2; y2) can be expressedas:


d2M =1

(1� r2)

�(x1 � x2)2

s21+(y1 � y2)2

s22� 2r (x1 � x2)(y1 � y2)

s1s2

�If r = 0, this distance is reduced to the Euclidean distance by standardizing the variablesusing their standard deviations. When r 6= 0 the Mahalanobis distance adds an additionalterm which is positive (and thus "separates"the points) when the di¤erences between thevariables have the same sign, when r > 0; or di¤erent when r < 0: For example, betweenweight and height there is a positive correlation: when the height of a person increases,on average, so does their weight. If we consider the three people from our earlier exampleA(180; 80); B(170; 72) and C(165; 81) with standard deviations 10 cm and 10 kgr and acorrelation coe¢ cient of 0.7, the squared Mahalanobis distances will be

d2M(A;B) =1

0; 51

�1 + 0; 82 � 1; 4� 0; 8

�= 1;02

andd2M(A;C) =

1

0; 51

�1; 52 + 0; 12 + 1; 4� 1; 5� 0; 1

�= 4;84;

we conclude that A is closer to B than to C with this distance. The Mahalanobis distancetakes into account that, although B is shorter than A, since there is a correlation betweenweight and height if his/her weight decreases proportionately as well, the physical aspectof both is similar: the overall size changes but not the body shape. Nevertheless, C is evenshorter that A and weighs more as well, which implies that his/her shape is quite di¤erentfrom A�s. As a result, the distance from A to C is greater than to B. The capacity of thisdistance to take into account the form of an element from its correlation structure explainswhy it was introduced by P.C. Mahalanobis in the 1930s to compare body measurements inhis anthropometric studies.

3.6.3. The average distance

We might decide to construct a global measure of variability for the mean of a vectorvariable choosing to average the distances between the points and the mean. For example, ifall the variables use the same units, we can take the squared Euclidean distance and averageby the number of terms in the sum:

Vm =1

n

nXi=1

(xi � x)0(xi � x): (3.15)

As a scalar is equal to its trace, we can express this as

Vm =nXi=1

tr

�1

n(xi � x)0(xi � x)

�=

nXi=1

tr

�1

n(xi � x)(xi � x)0

�= tr(S)

and the average of the distances is the total variability. If we also standardize by the dimen-sions of the vector, then:

Vm;p =1

np

nXi=1

(xi � x)0(xi � x) = s2 (3.16)


and the standardized average of the Euclidean distances between the points and the meanis the average of the variances of the variables.There is no reason to de�ne a measure of distance by averaging the Mahalanobis distances

since it is easy to prove (see exercise 3.12) that the average of the Mahalanobis distances isalways p, and the standardized average for the dimension of the vector is always equal toone.Example:The table, using the data on body measurements, MEDIFIS, shows the squared Euclidean

distances of each data point to its mean, d2e; the Mahalanobis distances of each data pointto its mean, D2

M ; the maximum Euclidean distance between each point and another in thesample, d2em; the order of the furthest piece of data with this distance, Ie , the maximumMahalanobis distance between each point and another in the sample, D2

Mm; and the orderof the furthest data point using this distance, IM :order d2e D2

M d2em Ie D2Mm IM

1.0000 3.8048 0.0226 29.0200 24.0000 29.0200 24.00002.0000 0.3588 0.0494 15.4800 24.0000 15.4800 24.00003.0000 0.2096 0.0447 10.0600 20.0000 10.0600 20.00004.0000 1.6899 0.0783 20.5925 24.0000 20.5925 24.00005.0000 2.2580 0.0759 23.8825 24.0000 23.8825 24.00006.0000 0.8336 0.0419 15.6000 24.0000 15.6000 24.00007.0000 2.8505 0.0830 23.5550 24.0000 23.5550 24.00008.0000 3.0814 0.0858 20.3300 20.0000 20.3300 20.00009.0000 3.6233 0.0739 21.7750 20.0000 21.7750 20.000010.0000 3.5045 0.0348 28.1125 24.0000 28.1125 24.000011.0000 2.0822 0.0956 20.2900 24.0000 20.2900 24.000012.0000 0.6997 0.1037 11.5425 20.0000 11.5425 20.000013.0000 6.2114 0.0504 34.7900 24.0000 34.7900 24.000014.0000 2.2270 0.0349 18.2700 20.0000 18.2700 20.000015.0000 4.2974 0.1304 23.2200 20.0000 23.2200 20.000016.0000 10.5907 0.1454 35.6400 20.0000 35.6400 20.000017.0000 1.7370 0.0264 16.9000 20.0000 16.9000 20.000018.0000 0.7270 0.0853 14.1100 24.0000 14.1100 24.000019.0000 4.5825 0.1183 30.5500 24.0000 30.5500 24.000020.0000 7.8399 0.0332 39.1100 24.0000 39.1100 24.000021.0000 4.4996 0.0764 23.9600 20.0000 23.9600 20.000022.0000 0.5529 0.0398 12.3100 20.0000 12.3100 20.000023.0000 3.9466 0.0387 29.3900 24.0000 29.3900 24.000024.0000 11.9674 0.0998 39.1100 20.0000 39.1100 20.000025.0000 0.4229 0.0745 10.6500 20.0000 10.6500 20.000026.0000 0.2770 0.0358 10.5850 20.0000 10.5850 20.000027.0000 0.9561 0.1114 17.6050 24.0000 17.6050 24.0000We observe that with the Euclidean distance the points farthest from the mean are 24

and 16, followed by 20. Points 24 and 20 are the most extreme with this measurement (seevariable Ie). With the Mahalanobis distances the points farthest from the mean are 15 and


16 but, nevertheless, the points that appear as extremes in the sample again are 20 and 24.Observing these data, 24 corresponds to a very tall man, the tallest in the sample, and 20 isa short, thin woman, which constitute the opposite extremes of the data.

3.7. MEASURES OF LINEAR DEPENDENCE

A basic objective in the description of multivariate data is to understand the structure ofdependencies between the variables. These dependencies can be studied: (1) between pairs ofvariables; (2) between one variable and the rest; (3) between pairs of variables but eliminatingthe e¤ect of the other variables; (4) within the set of all the variables. We will now analyzethese four aspects.

3.7.1. Dependency Pairs: The correlation matrix

Linear dependence between two variables is studied using the linear or simple correlationcoe¢ cient. This coe¢ cient for the variables xj; xk is:

rjk =sjksjsk

and has the following properties: (1) 0 � jrjkj � 1; (2) if there is an exact linear rela-tionship between the variables, xij = a + bxik; then jrjkj = 1; (3) rjk is invariant to lineartransformations of the variables.Dependency pairs are measured by the correlation matrix. We let the correlation matrix

R; be the square and symmetric matrix which has ones in the principal diagonal and outsideit has the linear correlation coe¢ cients between pairs of variables. We write this as:

R =

264 1 r12 : : : r1p...

... : : :...

rp1 rp2 : : : 1

375This matrix is positive semide�nite. To prove this, we let D = D(S) be the diagonal

matrix of order p formed by the elements of the principal diagonal of S; which are thevariances of the variables. The matrix D1=2 will contain the standard deviations and thematrix R is related to the covariance matrix, S; using:

R = D�1=2SD�1=2; (3.17)

which implies thatS = D1=2RD1=2: (3.18)

The condition w0Sw � 0 is equivalent to:

w0D1=2RD1=2w = z0Rz � 0

letting z = D1=2w be the new vector transformed by D1=2. Therefore, the matrix R , likethe matrix S; is positive semide�nite.

3.7. MEASURES OF LINEAR DEPENDENCE 53

3.7.2. Dependency of each variable and the rest: Multiple Regres-sion

In addition to studying the relationship between pairs of variables, we can study therelationship between one variable and all the rest. We have seen that if a variable is a linearcombination of the others, and thus can be predicted from them without error, we musteliminate it from consideration. It is possible, without reaching this extreme situation, thatthere are variables which are very dependent on the rest and we will study how to measuretheir degree of dependency.We shall assume that xj is the variable of interest, and, in order to simplify the notation,

we will call it the explanatory or response variable and will denote it as y. Next, we willlook at its best linear predictor starting from the remaining variables, which we will callexplanatory or regression variables. The form of this linear predictor is:

byi = y + b�1(xi1 � x1) + :::+ b�p(xip � xp); i = 1; :::; n (3.19)

and we see that when the explanatory variables take on a value equal to their mean theresponse variable is also equal to its mean. The p � 1 coe¢ cients b�k, for k = 1; :::; p withk 6= j; are determined such that the equation provides, on average, the best possible predictorof the values of yi: Letting the residuals be the prediction errors, ei = yi � byi; it is clear,summing for the n data points in (3.19), that the sum of the residuals for all the samplepoints is zero. This indicates that no matter what the coe¢ cients b�j are, equation (3.19) willcompensate the positive prediction errors with negative ones. Since we want to minimize theerrors independently of their sign, we square them and calculate the b�j minimizing:

M =nXi=1

e2i ;

Deriving this expression for the parameters b�j; yields the system of p � 1 equations, fork = 1; :::; p with k 6= j;:

2nXi=1

hyi � y + b�1(xi1 � x1) + :::+ b�p(xi;p � xp)i (xik � xk);

which can be written as: Xeixik = 0 k = 1; :::; p; k 6= j;

which has a clear intuitive interpretation. It indicates that the residuals, or prediction er-rors, must be uncorrelated with the explanatory variables, or that the covariance betweenboth variables is zero. Indeed, if there were a relationship between the residuals and theexplanatory variables, it could be used to predict the residuals and use this prediction toreduce them. Thus the prediction equation would not be optimal. Geometrically, this equa-tion establishes that the residual vector must be orthogonal to the space generated by theexplanatory variables. De�ning a matrix XR of data for the (n � p � 1) regression which


is obtained from the centered data matrix, eX; eliminating the column of this matrix whichcorresponds to the variable we wish to predict, which we will call y; the equation system forobtaining the parameters is:

X0Ry = X

0RXR

b�which yields: b� = (X0

RXR)�1X0Ry = S

�1p�1Sxy;

where Sp�1 is the covariance matrix of the p�1 explanatory variables and Sxy is the columnof the covariance matrix corresponding to the covariances of the variable y with the rest.The equation is known as the multiple regression equation between the variable y = xj andthe variables, xk; with k = 1; :::; p; and k 6= j:The average of the squared residuals to explain xj with the multiple regression equation

is:

s2r(j) =

Pe2in

(3.20)

and is a measure of the accuracy of the regression. We can obtain an adimensional measureof dependency using the identity

yi � y = byi � y + eiand squaring and summing for all the points yields the basic decomposition of the analysisof variance, which can be written as:

V T = V E + V NE

where the total or initial variability of the data, V T =P(yi � y)2 ; is expressed as the sum

of variability explained by the regression, V E =P(byi � y)2 ; and the squared residuals or

unexplained variabilities by the regression, V NE =Pe2i : A descriptive measure of the

predictive capacity of the model is the ratio between variability explained by regression andtotal variability. This measure is called the determination coe¢ cient, or squared multiplecorrelation coe¢ cient, and is de�ned by:

R2j;1;:::;p =V E

V T= 1� V NE

V T(3.21)

where the �rst subindex indicates the variable we are explaining and the remaining ones arethe regressors used. For (3.20), we can write:

R2j;1;:::;p = 1�s2r(j)

s2j: (3.22)

It is easily proved that in the case of a single explanatory variable R2 is the square ofthe simple correlation coe¢ cient between the two variables. We can also prove that it isthe square of the simple correlation coe¢ cient between the variables y and by: The squaredmultiple correlation coe¢ cient can be greater or smaller than the sum of the squares of thesimple correlations between the variable y and each one of the explanatory variables.


According to equation (3.22) we can calculate the multiple correlation coe¢ cient betweenany variable xi and the rest if we know its variance and the residual variance of a regressionof this variable over the others. Appendix 3.1 shows that the diagonal terms of the inverseof the covariance matrix, S�1; are precisely the inverses of the residual variances of theregression of each variable with the rest. Therefore, we can easily calculate the squaredmultiple correlation coe¢ cient between the variable xj and the rest as follows:(1) Take the diagonal element j of the matrix S; sjj which is the variance s2j of the

variable to be predicted.(2) Invert the matrix S and take the diagonal element j of the matrix S�1 which we will

call sjj: This term is 1=s2r(j); the residual variance of a regression between the variable j andthe rest.(3) Calculate R2j , the multiple correlation with:

R2j = 1�1

sjjsjj

This equation permits us to immediately obtain all the multiple correlation coe¢ cients be-tween a variable and the rest using the matrices S and S�1:Example:The correlation matrix for seven body variables, medi�s.dat �le, from example 3.1, is

shown next. The variables appear in the order used in 3.1.

R =

2666666664

1 0;83 0;93 0;91 0;84 0;59 0;840;83 1 0;85 0;82 0;84 0;62 0;720;93 0;85 1 0;85 0;80 0;55 0;850;91 0;82 0;85 1 0;80 0;48 0;760;84 0;84 0;80 0;80 1 0;63 0;630;59 0;62 0;55 0;48 0;63 1 0;560;84 0;72 0;85 0;76 0;63 0;56 1

3777777775We observe that the maximum correlation appears between the �rst and third variable(height and foot length) and is 0.93. The minimum, between arm length and cranium diam-eter (0.48). In general, the lower correlations appear between cranium diameter and the restof the variables. The matrix S�1 is:

0;14 0;01 �0;21 �0;11 �0;07 �0;05 �0;070;01 0;04 �0;08 �0;03 �0;04 �0;04 �0;00�0;21 �0;08 1;26 0;06 �0;05 0;18 �0;29�0;11 �0;03 0;06 0;29 �0;04 0;13 �0;04�0;07 �0;04 �0;05 �0;04 0;34 �0;13 0;15�0;05 �0;04 0;18 0;13 �0;13 0;64 �0;15�0;07 �0;00 �0;29 �0;04 0;15 �0;15 0;50

and using the diagonal elements of this matrix and the matrix S we can calculate the multiplesquared correlations of each variable and the rest as follows: (1) We multiply the diagonalelements of the matrices S and S�1: The result of this operation is the vector (14.3672,5.5415, 9.9898, 6.8536, 5.3549, 2.0784, 4.7560); (2) Next, we calculate the inverses of thoseelements, which are (0.0696 0.1805 0.1001 0.1459 0.1867 0.4811 0.2103); (3) Finally, we


subtract from one these coe¢ cients obtaining (0.9304, 0.8195, 0.8999, 0.8541, 0.8133, 0.5189,0.7897), which are the multiple correlation coe¢ cients between each variable and the rest.We see from the remainders that the most predictable variable is height,(R2 = 0.9304), thenfoot length (R2 = 0.8999) followed by arm length (R2 = 0.8541). The least predictable is thediameter of the cranium, which has a multiple correlation coe¢ cient with the rest of 0.5189,or in other words, the rest of the variables explain 52% of the variability of this variable.The equation for predicting height as a function of the rest of the variables is easily

obtained with any regression program. The result isht = 0.9 - 0.094 wt+ 1.43ftl + 0.733 arml + 0.494 bkw + 0.347 crd + 0.506 kn-al

which is the equation that lets us predict with less error the height of a person given therest of the measurements. The R2 of this regression is = 0,93, the result we obtained earlier.The equation for predicting foot length is:

ftl = 8.14 + 0.162 ht + 0.0617 wt - 0.051 arml + 0.037 bkw - 0.144 crd + 0.229kn-alwhich indicates that to predict foot length the most relevant variables are height and kneeto ankle length. If we take sex as the explanatory variable, then:sex = - 3.54 - 0.0191 ht - 0.0013 wt + 0.141 ftl + 0.0291 arml + 0.0268 bkw- 0.0439 crd + 0.0219 kn-al

and the most important variable in predicting the sex of a person is foot length, which isthe variable with the highest coe¢ cient in the regression.

3.7.3. Direct dependency between variables: Partial correlations

The direct dependency between two variables controlling the e¤ect of others is given bythe partial correlation coe¢ cient. The partial coe¢ cient correlation between two variables,(x1; x2), is de�ned by eliminating or controlling the variables (x3; :::; xp); and is denotedby r12;3::p; as the correlation coe¢ cient between variables x1 and x2 when the e¤ects of thevariables (x3; :::; xp) have been eliminated from them.This coe¢ cient is obtained in two stages. First, we must �nd the part of the two variables

that is not explained by (or is free of the e¤ects of) the group of variables that are controlled.This part is the residual of the regression of each variable over the set of variables (x3; :::; xp);since, by construction, the residual is the part of the response that cannot be predicted by or isindependent of the regressors. Second, we calculate the simple correlation coe¢ cient betweenthese two residuals. Appendix 3.3 shows the proof that the partial correlation coe¢ cientsbetween each pair of variables is obtained by standardizing the elements of the matrix S�1:Speci�cally, if we let sij be the elements of S�1; the partial correlation coe¢ cient betweenthe variables xj; xk is obtained with:

rjk;12;:::;p = �sijpsiisjj

(3.23)

The partial correlation coe¢ cients can also be calculated from the multiple correlation coef-�cients using the equation, which is shown in Appendix 3.3:

1� r212;3::p =1�R21;2;:::;:p1�R21;3;:::;p

;


where r212;3::p is the squared partial correlation coe¢ cient between the variables (x1; x2) whenthe variables (x3; :::; xp) are controlled, R21;2;:::;p is the determination coe¢ cient, or squaredmultiple correlation coe¢ cient, in the regression of x1 with respect to (x2; x3; :::; xp) andR21;3;:::;p is the determination coe¢ cient in the regression of x1 with respect to (x3; :::; xp): (Theresult is invariant to exchanging x1 for x2): This expression indicates a simple relationshipbetween terms of type1 � r2; which, according to equation (3.21), represents the relativeproportion of unexplained variability.The partial correlation matrix, P; contains the partial correlation coe¢ cients between

pairs of variables eliminating the e¤ects of the rest. For example, for four variables, thepartial correlation matrix is:

P4 =

26641 r12;34 r13;24 r14;23

r21;34 1 r23;14 r24;13r31;24 r32;14 1 r34;12r41;23 r42;13 r43;12 1

3775where, for example, r12;34 is the correlation between the variables 1 and 2 when we eliminatethe e¤ect of 3 and 4, or when the variables 3 and 4 remain constant. According to (3.23)this matrix is obtained as

P = (�1)diagD(S�1)�1=2S�1D(S�1)�1=2;where D(S�1) is the diagonal matrix obtained by selecting the diagonal elements of thematrix S�1 and the term (�1)diag indicates that we change the sign of all the elements of thematrix except for the diagonal elements, which will be the unit. Equation (3.23) is similarto that of (3.17), but it uses the matrix S�1 instead of S:We see that D(S�1)�1=2 is not theinverse of D(S)�1=2 = D�1=2; and that, as a result, P is not the inverse matrix of R:

3.7.4. The E¤ective Dependence Coe¢ cient

In order to obtain a joint measure of dependency among variables we can use the deter-minant of the correlation matrix, which measures the distance of the set of variables fromthe situation of perfect linear dependence. Appendix 3.2 shows the proof that 0 � jRj � 1and:(1) If the variables are all uncorrelated R is a diagonal matrix with ones in the diagonal

and jRj = 1:(2) If a variable is a linear combination of the rest we have seen that S and R are singular

and jRj = 0(3) In the general case, Appendix 3.3 presents the proof that:

jRpj =�1�R2p;1:::p�1

� �1�R2p�1;1��p�2

�:::�1�R22;1

�: (3.24)

that is, the determinant of the correlation matrix is the product of p � 1 terms. The �rstrepresents the proportion of unexplained variability in a multiple regression between thevariable p and the remaining variables, p�1; p�2; :::; 1: The second represents the proportionof unexplained variability in a multiple regression between the variable p�1 and the following


remaining variables, p � 2; p � 3; :::; 1: The last represents the proportion of unexplainedvariability in a simple regression between the variables two and one.In agreement with this property, jRpj1=p�1 represents the geometric mean of the pro-

portion of variability explained by the above regressions. We observe that it is also thegeometric mean of the eigenvalues of the matrix Rp; taking into account that we only havep� 1 independent eigenvalues as they must verify

P�i = p :

The coe¢ cient of E¤ective dependency is de�ned by:

D(Rp) = 1-jRpj1=(p�1) (3.25)

and is a good global measure of dependency in the data (see Peña and Rodríguez, 2000).For example, p = 2 since jR2j = 1� r212; this measure coincides with the square of the linearcorrelation coe¢ cient between the two variables. For p > 2 we can write that (3.24) and(3.25):

1�D(Rp) =��1�R2p;1:::p�1

� �1�R2p�1;1��p�2

�:::�1�R22;1

��1=(p�1)and we see that the e¤ective dependence is the correlation coe¢ cient needed for the un-explained variability in the problem to be equal to the geometric mean of all the possibleunexplained variabilities. The e¤ective correlation coe¢ cient is given by

�(Rp) = D(Rp)1=2 =

q1� jRpj1=(p�1):

In the speci�c case in which p = 2, the average correlation coe¢ cient coincides with theabsolute value of the simple correlation coe¢ cient.Example:We are going to construct a partial correlation matrix for the seven body measurements,

medi�s.dat �le. We can construct this matrix starting from S�1; standardizing by the diagonalelements to obtain:

P =

1;00 �0;19 0;48 0;52 0;32 0;17 0;27�0;19 1;00 0;37 0;30 0;34 0;26 0;000;48 0;37 1;00 �0;11 0;07 0;20 0;370;52 0;30 �0;11 1;00 0;13 �0;31 0;100;32 0;34 0;07 0;13 1;00 0;29 �0;370;17 0;26 0;20 �0;31 0;29 1;00 0;270;27 0;00 0;37 0;10 �0;37 0;27 1;00

This matrix shows that the strongest partial relationships are between height and footlength (0.48) and arm length (0.52). For example, the interpretation of this coe¢ cient is thatif we consider people with the same weight, foot length, back width, cranium diameter andknee to ankle length, there is a positive correlation between height and arm length of 0.52.The table shows that for people of the same height, weight and other body measurements,the correlation between back width and knee to ankle length is negative.To obtain a measure of global dependence, since the determinant of R is 1;42� 10�4 and

the e¤ective coe¢ cient of dependency is

D = 1� jRj1=6 = 1� 6p1;42� 10�4 = 0;771

3.8. THE PRECISION MATRIX 59

and we can conclude that, globally, the linear dependence explains 77% of the variabilityof this set of data.Example:We calculate the e¤ective dependency coe¢ cient for the data in the �les indicated in the

original units in which they were presented,EUROALI EUROSEC EPF INV EST MUNDODES ACCION

D : 51 : 80 : 62 ;998 ;82 ;61We observe that in INVEST the joint dependency is very strong. This suggests that we

can reduce the number of variables needed to describe the information that it contains.

3.8. The precision matrix

The inverse of the covariance matrix is called the precision matrix. This matrix plays animportant role in many statistical procedures, as we will see in later chapters. One importantresult is that the precision matrix contains information about the multivariate relationshipbetween each variable and the rest. This result is surprising at �rst glance as the covariancematrix only contains information about the relationships by pairs of variables. It can beproved (see Appendix 3.1) that the inverse of the covariance matrix contains:(1) By rows, and outside the diagonal, the multiple regression coe¢ cients of the variable

corresponding to this row explained by all the others, with a change of sign and multipliedby the inverse of the residual variance in this regression. That is, we let sij be the elementsof the precision matrix:

sij = �b�ij=s2r(i);where b�ij is the regression coe¢ cient of the variable j to explain the variable i, and s2r(i) isthe residual variance of the regression.(2) In the diagonal, the inverse of the residual variances in the regression of each variable

with the rest. That is:sii = 1=s2r(i):

(3) If we standardize the elements of this matrix so that it has ones in the diagonal, theelements outside the diagonal are the partial correlation coe¢ cients between these variables:

rij:R = �sijpsiisjj

where R refers to the rest of the variables, that is, the set of p�2 variables xk with k = 1; :::; pand k 6= i; j:For example, with four variables, the �rst row of the inverse covariance matrix is

s�2R (1);�s�2R (1)b�12;�s�2R (1)b�13;�s�2R (1)b�14where s2R(1) is the residual variance of a regression between the �rst variable and the otherthree and b�12; b�13; b�14 are the regression coe¢ cients in the equation

bx1 = b�12x2 + b�13x3 + b�14x4


where we assume, with no loss of generality, that the variables have a zero mean. Therefore,the matrix S�1 contains all the information about the regressions of each variable in theothers.Example:Let us calculate and interpret the precision matrix of the data of the logarithms from the

stocks, acciones.dat �le, from example 3.5. This matrix is

S�1 =

24 52;0942 �47;9058 52;8796�47;9058 52;0942 �47;120452;8796 �47;1204 60;2094

35For example, the �rst row of this matrix can be written as 52;0942�(1;0000;�0;9196; 1;0151)

which indicates that the residual variance of a regression between the �rst variable and theother two is 1=52;0942 = ;0192; and the regression coe¢ cients of the variables X2 and X3 ina regression to explain X1 are �0;9196 and 1;0151 respectively. We observe once again thatthe equation z = X1�X2+ X3 seems to have little variability. The variance of the regression,0.019, is less than that of the variable z; since it represents a conditioned variability whenthe variables X2 and X3 are known.

3.9. ASYMMETRYANDKURTOSIS COEFFICIENTS

The generalization of the asymmetry and kurtosis coe¢ cients to the multivariate case isnot immediately obvious. One of the most often used proposals is from Mardia (1970), whoproposed calculating the Mahalanobis distances for each pair of sample elements (i; j):

dij = (xi � x)0 S�1 (xj � x) :

and de�ned the multivariate asymmetry coe¢ cient in the joint distribution of the p variablesas

Ap =1

n2

nXi=1

nXj=1

d3ij;

and the kurtosis coe¢ cient

Kp =1

n

nXi=1

d2ii:

These coe¢ cients have the following properties:1. For scalar variables Ap = A2 : Note that, then

Ap =1

n2

nXi=1

nXj=1

(xi � x)3(xj � x)3=s6 =(Pn

i=1(xi � x)3)2n2s6

= A2

2. The asymmetry coe¢ cient is non-negative and will be zero if the data are distributedhomogeneously in a sphere.3. For scalar variables K = Kp: The result follows immediately because then d2ii =

(xi � x)4=s4:

3.9. ASYMMETRY AND KURTOSIS COEFFICIENTS 61

4. The coe¢ cients are invariant to linear transformations of the data. If y = Ax+b, theasymmetry and kurtosis coe¢ cients of y and of x are identical.Example:We calculate the multivariate asymmetry and kurtosis coe¢ cients for the data on prof-

itability of the stocks. It can be proven that if we take the data in its original measurements,the multivariate asymmetry coe¢ cient is: Ap = 16;76: This value will be, in general, greaterthan the univariate coe¢ cients, which are, respectively, 0.37, 0.04, and 2.71. If we take thelogarithms of the data Ap = 7;5629 , whereas the univariates are 0.08, -0.25 and 1.02. Wecan conclude that, e¤ectively, the logarithmic transformation serves to make the data moresymmetrical. The multivariate kurtosis coe¢ cient is Kp = 31;26, which must be comparedwith the univariate values of 1.38, 1.40, and 12.44. By taking logarithms the multivariatecoe¢ cient is Kp = 21;35, whereas the univariates are 1.43, 1.75, and 4.11, thus we see thatthe kurtosis can also be reduced using logarithms.

EXERCISESCalculate the vector of means and of medians for the three variables in ACCIONES,

Table A.7. Compare their advantages as location measures of these variables.There are 3 economic indicators x1; x2; x3; which are measured in four countries, with the

following results:x1 x2 x32 3 -11 5 -22 2 12 3 1

Calculate the vector of means, the variance and covariance matrix, the generalized variance,the correlation matrix and the largest characteristic root and vector of those matrices.Starting with the three economic indicators, x1; x2; x3; from the above exercise, two new

indicators are constructed

y1 = (1=3)x1 + (1=3)x2 + (1=3)x3

y2 = x1 � 0; 5x2 � 0; 5x3Calculate the vector of means for y0 = (y1; y2), its covariance matrix, the correlation matrixand the generalized variance.

Prove that the matrix�1 rr 1

�has eigenvalues 1 + r and 1 � r and eigenvectors (1; 1)

and (1;�1).Prove that if Y = XA where Y is n �m and X is n � p the covariance matrix of Y is

related to that of X by Sy= A0SxA.Calculate the multiple correlation coe¢ cients between each variable and all the rest for

the INVES data.Calculate the partial correlation matrix for the INVES data.Prove that the residual variance of a multiple regression between a variable y and of set

of x can be written as s2y(1 � R2) where s2y is the variance of the variable y and R2 is themultiple correlation coe¢ cient.


Calculate the partial correlation coe¢ cients between the variables of the set of stockswith regressions and using the elements of the matrix S�1 and prove the equivalence.Calculate the coe¢ cient of multivariate asymmetry for a vector of two uncorrelated vari-

ables. What is the relationship between the multivariate and univariate coe¢ cients of asym-metry?Repeat the above exercise for the kurtosis coe¢ cients.Prove that for a set of data 1

np

Pni=1(xi�x)0S�1(xi�x) = 1 (suggestion, take traces and

use tr[Pn

i=1(xi � x)0S�1(xi � x)] = tr [S�1Pn

i=1(xi � x)(xi � x)0]):Prove that we can calculate the matrix of Euclidean distances between the points with

the operation diag(XX0)1 + 1diag(XX0)� 2X0X; where X is the data matrix, diag(XX0)is the vector whose components are the diagonal elements and 1 is a vector of ones.Prove that we can calculate the matrix of Mahalanobis distances between the points

with the operation diag(XS�1X0)1+ 1diag(XS�1X0)� 2X0S�1X; where X is the data ma-trix, diag(XS�1X0) is the vector whose components are the diagonal elements of the matrixXS�1X0, and 1 is a vector of ones.

APPENDIX 3.1: THE STRUCTURE OF THE PRE-CISION MATRIX

We partition the matrix S separating the variables into two blocks: variable 1 which wewill call y, and the rest, which we call R. Then:

S =

�s21 c01Rc1R SR

�where s21 is the variance of the �rst variable, c1R the vector of covariance between the �rstand the rest and SR is the covariance matrix of the rest. Its inverse, using the results fromthe previous chapter on the inverse of a partitioned matrix, is:

S�1 =

� �s21 � c01RS�1R c1R

��1A12

A21 A22

�:

To simplify, we assume that the mean of all the variables is zero. Then the regression ofthe �rst variable over the rest has the coe¢ cients:b�1R = S�1R c1R;In order to �nd the relationship we are looking for, we use the basic identity of the analysisof variance (ANOVA):

1

nV T =

1

nV E +

1

nV NE

We apply this decomposition to the �rst variable. The �rst term is s21; the variance of the�rst variable, and the second, since by = XR

b�1R; can be written:1

nV E =

1

n(by0by) = b�01RSRb�1R

= c01RS�1R SRS

�1R c1R = c

01RS

�1R c1R;


and the third, V NE=n =Pe21R=n = s

2r(1), where we let e1R be the residuals of the regression

of the �rst variable over the rest, and s2r(1) be the residual variance of this regression withoutcorrecting for degrees of freedom. Putting these terms into the basic identity of the ANOVA,we �nd that the residual variance can be calculated as:

s2r(1) = s21 � c01RS�1R c1R:

If we compare this equation with the �rst term of the matrix S�1 we conclude that the �rstdiagonal term of S�1 is the inverse of the variance of the residuals (divided by n and withoutcorrecting for degrees of freedom) in a regression between the �rst variable and the rest.Since this analysis can be done on any of the variables, we conclude that the diagonal termsof S�1 are the inverses of the residual variances in the regressions between each variable andthe rest.In order to obtain the expression of the terms outside the diagonal in S�1; we apply the

formula for the inverse of the partitioned matrix:

A12 = ��s2r(1)

��1c01RS

�1R = �

�s2r(1)

��1 b�01R;and, therefore, the rows of the matrix S�1 contain the regression coe¢ cients (signs changed)of each variable with respect to the rest divided by the residual variance of the regression(without correcting for degrees of freedom).To summarize, S�1 can be written:

S�1 =

26664s�2r (1) �s�2r (1)b�01R: : : : : :: : : : : :

�s�2r (p)b�0pR s�2r (p)

37775 ,

where b�jR represents the vector of the regression coe¢ cients by explaining the variable jthrough the remaining variables. We observe that in this matrix the subindex R refers to theset of p� 1 variables that remain upon taking as the response variable that which occupiesthe place in the corresponding row of the matrix. For example, b�0pR is the vector of regressioncoe¢ cients between p and the (1; :::p� 1):

APPENDIX 3.2: THEDETERMINANTSOF SANDR

We are going to obtain equations for the determinants of the covariance matrix andcorrelation, using the results for partitioned matrices from Chapter 2. We write the covariancematrix as:

Sp =

�s21 c01Rc1R Sp�1

�


where s21 is the variance of the �rst variable, c01R contains the covariances between the �rst and

the rest and we now use the notation Sp to refer to the covariance matrix of the correspondingp variables. Applying the formula for the determinant of a partitioned matrix, we can write

jSpj = jSp�1j s21�1�R21;2:::p

�where R21;2:::p is the multiple correlation coe¢ cient between the �rst variable and the restwhich is given, using the results from Appendix 3.1, by

R21;2:::p =1

s21c01RSp�1c1R

Analogously we write the partitioned correlation matrix as

Rp =

�1 r01Rr1R Rp�1

�where r1R and Rp�1 are, respectively, the correlation vector of the �rst variable to the restand the correlation matrix of the remaining variables. Then,

jRpj = jRp�1j�1�R21;2:::p

�; (3.26)

sinceR21;2:::p = r

01RRp�1r1R:

To prove this equality, we observe that the relationship between the correlation andcovariance vectors is r1R = D

�1=2p�1 c1R=s1; where D

�1=2p�1 contains the inverses of the standard

deviations of the p� 1 variables. Since Rp�1 = D�1=2p�1 Sp�1D

�1=2p�1 ; we have

r01RRp�1r1R = (c01R=s1)D

�1=2p�1 D

1=2p�1S

�1p�1D

1=2p�1D

�1=2p�1 (c1R=s1) =

1

s21c01RS

�1p�1c1R = R

21;2:::p

Successively applying equation (3.26), we have

jRpj =�1�R21;2:::p

� �1�R22;3��p

�:::�1� r2p�1:p

�:

APPENDIX 3.3: PARTIAL CORRELATIONS

The partial correlation coe¢ cient is the simple correlation coe¢ cient in a regression be-tween residuals. Its square can usually be interpreted as the proportion of explained variancewith respect to the total, where in this case total variation is that which is unexplained bya previous regression. We are going to use this interpretation to obtain the relationshipbetween the partial and multiple correlation coe¢ cients.


We assume p variables and we are going to �nd the partial correlation coe¢ cient betweenthe variables x1, and x2; when x3; :::; xp are controlled. To do this we carry out a simpleregression between two variables: the �rst is e1;3::p; the residuals of a regression between x1and x3; :::; xp; and the second e2;3::p; the residuals of a regression between x2 and x3; :::; xp: Thesimple correlation coe¢ cient between residuals, r12;3:::p , is the partial correlation coe¢ cient.The construction of this coe¢ cient is symmetric between the pair of variables, but assumingthat we take the �rst variable as dependent in the regression, the estimated equation betweenresiduals is be1;3;:::;p = b12;3;:::;pe2;3;::;:pand the correlation coe¢ cient of this regression, which is the partial correlation, is

r12;3:::p = b12;3;:::;ps(e1;3;:::;p)

s(e2;3;::;:p)

We are going to prove that these terms can be obtained from the matrix S�1: In this matrixs12 = �s�2r (1)b�12;3;:::;p = s21 = �s�2r (2)b�12;3;:::;p since the matrix is symmetric. Dividing s12

by the root of the elements s11 and s22; and changing the sign, we have

r12;3:::p =s12ps22s11

=s�2r (1)

b�12;3;:::;ps�1r (1)s

�1r (2)

= b�12;3;:::;p sr(2)sr(1)

which is a fast way to compute r12;3:::p, the partial correlation coe¢ cient.In the regression between residuals the quotient of the unexplained variability over the

total is one minus the square of the correlation coe¢ cient. The unexplained variability in thisregression is the unexplained variability of the �rst variable with respect to all the variables,which we will call V NE1;23:::p (e1;3::p contained the part unexplained by the variables 3; ::; pand we have now added x2): The total variability of the regression is that of the residuals,e1;3::p; or rather, that which is unexplained in the regression of x1 with respect to x2; :::; xp:Thus, we can write:

1� r212;3:::p =V NE1;2;3;:::;pV NE1;3:::p

We are going to express V NE as a function of the multiple correlation coe¢ cients of thecorresponding regressions. Letting R21;3:::p be the determination coe¢ cient in the multipleregression of x1 with respect to x3; :::; xp:

1�R21;3:::p =V NE1;3;::;:pV T1

where V T1 is the variability of the �rst variable. Analogously, in the multiple regressionbetween the �rst variable and all the rest, x2; x3; :::; xp we have

1�R21;23:::p =V NE1;2;3;:::;p

V T1:

From these three equations we obtain

1� r212;3:::p =1�R21;23:::p1�R21;3:::p

(3.27)


which permits us to calculate the partial correlation coe¢ cient as a function of the multiplecorrelation coe¢ cients. Applying this expression reiteratively we can also write�

1�R21;23:::p�=�1� r212;3:::p

� �1� r213;4:::p

�:::�1� r21p�1:p

� �1� r21p

�It can also be proven (see Peña, 2000) that the partial correlation coe¢ cient between

the variables (x1; x2) when the variables (x3; :::; xp) are controlled can be expressed as afunction of the regression coe¢ cient of the variable x2 in the regression of x1 with respectto (x2; x3; :::; xp); and its variance. The expression is:

r12;3:::p = b�12;3:::p rb�212;3:::p + (n� p� 1)s2 hb�12;3:::pi)

!

where b�12;3:::p and its variance, s2 hb�12;3:::pi are obtained in the regression between variablesof zero mean: bx1 = b�12;3:::px2 + b�13;2:::px3 + :::+ b�1p;2:::p�1xp

Documents

Capítulo 3halweb.uc3m.es/esp/Personal/personas/dpena/... · Capítulo 3 DESCRIPTION OF MULTIVARIATE DATA P. C. Mahalanobis (1893-1972) An Indian statistician, Mahalanobis studied