Modeling Multivariate Data

Modeling Multivariate Data

E. James Harner

April 10, 2012

Contents

1 Introduction 11.1 Multivariate Data Structures . . . . . . . . . . . . . . . . . . . . 11.2 Variable Typologies . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Individual and Variable Space . . . . . . . . . . . . . . . . . . . . 41.4 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Numerical Summaries 92.1 Measures of Location . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Measures of Scale and Association . . . . . . . . . . . . . . . . . 102.3 Derived Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Robust Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Graphical Techniques 173.1 Assessing Distributional Assumptions . . . . . . . . . . . . . . . 17

3.1.1 Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Probability Plots . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 Multivariate Residual Plots . . . . . . . . . . . . . . . . . 223.1.4 Directional Normality . . . . . . . . . . . . . . . . . . . . 23

3.2 Visualizing Multivariate Data . . . . . . . . . . . . . . . . . . . . 243.2.1 1-D Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 2-D Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.3 3-D Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.4 (d > 3)-D Plots . . . . . . . . . . . . . . . . . . . . . . . . 263.2.5 Dynamic Projection Plots . . . . . . . . . . . . . . . . . . 26

4 Correlation Analysis 284.1 Multiple Correlation Analysis . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Partial Correlation . . . . . . . . . . . . . . . . . . . . . 294.1.2 Multiple Correlation . . . . . . . . . . . . . . . . . . . . . 33

4.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . 354.3 Partial Correlation Analysis . . . . . . . . . . . . . . . . . . . . . 394.4 Assessing the Correlation Model . . . . . . . . . . . . . . . . . . 40

CONTENTS i

5 Principal Component Analysis 415.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 Sample Principal Variables . . . . . . . . . . . . . . . . . 425.1.2 Robust Principal Variables . . . . . . . . . . . . . . . . . 445.1.3 Population Principal Variables . . . . . . . . . . . . . . . 45

5.2 Examining Principal Variables . . . . . . . . . . . . . . . . . . . 455.2.1 Interpreting Principal Variables . . . . . . . . . . . . . . . 455.2.2 Determining Dimensionality . . . . . . . . . . . . . . . . . 475.2.3 Viewing Principal Variables . . . . . . . . . . . . . . . . . 485.2.4 Fitting Principal Variables . . . . . . . . . . . . . . . . . 48

5.3 Plotting Principal Variables . . . . . . . . . . . . . . . . . . . . 485.4 Generalized PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.5 Constrained PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.5.1 Canonical Principal Variables . . . . . . . . . . . . . . . 535.5.2 Partial Principal Variables . . . . . . . . . . . . . . . . . . 555.5.3 Partial Canonical Principal Variables . . . . . . . . . . . . 55

6 Correspondence Analysis 576.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1.1 Count Variable Model . . . . . . . . . . . . . . . . . . . . 586.1.2 Categorical Variable Model . . . . . . . . . . . . . . . . . 59

6.2 Plotting Correspondent Variables . . . . . . . . . . . . . . . . . . 596.3 Constrained Correspondence Analysis . . . . . . . . . . . . . . . 606.4 Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . 60

7 Factor Analysis 617.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 The Principal Factor Method . . . . . . . . . . . . . . . . . . . . 627.3 Variable Space Interpretation . . . . . . . . . . . . . . . . . . . . 627.4 Sample Principal Factors . . . . . . . . . . . . . . . . . . . . . . 637.5 The rotation Probelm . . . . . . . . . . . . . . . . . . . . . . . . 63

8 Discriminant Analysis 648.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.1.1 The 2-sample Problem . . . . . . . . . . . . . . . . . . . . 648.1.2 The g-sample Problem . . . . . . . . . . . . . . . . . . . . 66

8.2 Interpreting Discriminant Variables . . . . . . . . . . . . . . . . . 708.2.1 Stepwise Discriminant Analysis . . . . . . . . . . . . . . . 74

8.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9 Multivariate General Linear Models 789.1 Multivariate Regression . . . . . . . . . . . . . . . . . . . . . . . 789.2 Multivariate Analysis of Variance . . . . . . . . . . . . . . . . . . 789.3 Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 79

CONTENTS ii

10 Exploratory Projection Pursuit 8010.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8010.2 Projection Pursuit Indices . . . . . . . . . . . . . . . . . . . . . . 8010.3 Projection Pursuit Guided Tours . . . . . . . . . . . . . . . . . . 80

11 Cluster Analysis 8111.1 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . 8211.2 Measures of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8611.3 Other Agglomerative Methods . . . . . . . . . . . . . . . . . . . 8911.4 Divisive Hierarchical Clustering . . . . . . . . . . . . . . . . . . . 9011.5 Non-hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . 90

11.5.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . 9011.5.2 ISODATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 9211.5.3 Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 93

12 Multidimensional Scaling 9412.1 Metric MDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

12.1.1 Classical MDS . . . . . . . . . . . . . . . . . . . . . . . . 9512.1.2 Least Squares MDS . . . . . . . . . . . . . . . . . . . . . 9612.1.3 Sammon’s MDS . . . . . . . . . . . . . . . . . . . . . . . 96

12.2 Nonmetric MDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 9812.3 Assessment and Interpretation of the Fit . . . . . . . . . . . . . . 101

12.3.1 Determining Dimensionality . . . . . . . . . . . . . . . . . 10112.3.2 Assessing the Fit . . . . . . . . . . . . . . . . . . . . . . . 10312.3.3 Interpretation of the Spatial Representation . . . . . . . . 103

12.4 Joint Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 104

A Vector and Matrix Algebra 105A.1 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 Vector Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111A.3 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . 113

B Distribution Theory 117B.1 Random Variables and Probability Distributions . . . . . . . . . 117B.2 The Normal Family . . . . . . . . . . . . . . . . . . . . . . . . . . 118B.3 Other Continuous Distributions . . . . . . . . . . . . . . . . . . . 121B.4 Maximum Likelihood and Related Estimators . . . . . . . . . . . 122

C Statistical Software 123C.1 Lisp-Stat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123C.2 StatObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Chapter 1

Introduction

Multivariate analysis is the process of finding patterns in high-dimensional dataand in formalizing this structure in a model. Although structure is sometimesspecified a priori , guided explorations of multivariate space—numerically andgraphically—are more common activities.

1.1 Multivariate Data Structures

The basic objects of statistical analyses are individuals and variables. An indi-vidual is often called a case, observation, respondent, subject, etc., depending onthe field of investigation. The generic term individual will be used throughoutthis book except in examples in which the context suggests a more appropriatename. A variable1 is an abstract object which assigns a unique value to each in-dividual under study. Examples of individuals and variables in several researchareas are given in the following table:

Field Individual VariablesBusiness Company Financial CharacteristicsEcology Bird Habitat VariablesEpidemiology Individual Risk FactorsEngineering Process Operating CharacteristicsGenomics Sample GenesGeology Wells Site and Production VariablesPsychology Subjects Personality Trait VariablesSociology City Crime Variables

Variables defined on the same individuals define a data table. The datavalues, determined for each individual by each variable, are arranged in a tablewith rows representing individuals and columns representing variables. The ith

1The term random variable is often used in mathematical statistics to denote a rule forassigning values to the outcomes of an experiment.

1

CHAPTER 1. INTRODUCTION 2

row gives the values for each variable on the ith individual; the jth column givesthe values for all individuals on the jth variable. Put another way, the ijth cellis the value of the jth variable measured on the ith individual.

The most general situation is when p variables are measured on each ofn individuals without sampling restrictions. The n individuals comprising thesample are obtained (perhaps randomly) from the complete population. In othercases, certain variables, called design variables, are fixed by the researcher. Acategorical design variable partitions the population into groups. The distinc-tion between random and design variables may be important in interpretatinganalyses.

Let p numerical variables be denoted by Y1, Y2, · · · , Yp. The data table canbe represented by the following data matrix:

Y =

y11 y12 · · · y1p

y21 y22 · · · y2p

......

. . ....

yn1 yn2 · · · ynp

=[yij]

where Y is n × p and yij is the value of the jth variable on the ith individual.A matrix is a mathematical construct consisting of numerical entries.

A value for a categorical variable is a name, or a number representing aname, and as such contains no numerical information. Therefore, categoricalvariables should not be represented in the data matrix directly. Instead, cat-egorical variables are expressed in numerical form by using dummy variables.Suppose Yj is a categorical variable with g levels. The values 1, 2, · · · , g arecalled the formal levels of Yj and are in one-to-one correspondence with the gnames representing the groups. Then Yj can be expressed in numerical form byg dummy variables denoted by Yj1 , Yj2 , · · · , Yjg , where Yjk is 1 if the jth variableevaluates to the kth level and is 0 otherwise, i.e., Yjk represents the presenceor absence of the kth level of Yj . These g dummy variables correspond to gcolumns in the resulting data matrix.

The crime dataset in mult has 7 crime variables and 3 social variables foreach state.

> library(mult)

> data(airpoll)

> airpoll[1:5,]

Rainfall Education Popden Nonwhite NOX SO2 MortalityakronOH 36 11.4 3243 8.8 15 59 921.9albanyNY 35 11.0 4281 3.5 10 39 997.9allenPA 44 9.8 4260 0.8 6 33 962.4atlantGA 47 11.1 3125 27.1 8 24 982.3baltimMD 43 9.6 6441 24.4 38 206 1071.0


1.2 Variable Typologies

Variables define measurable quantities or categories which vary from individ-ual to individual in a population. In this book variables are represented byX1, X2, · · · , Xq; Y1, Y2, · · · , Yp; Z1, Z2, · · · , Zr. The distinction is that the X’srepresent explanatory (or independent) variables, the Z’s represent conditioning(or nuisance) variables, and the Y ’s represent outcome (or dependent) variables.In this general case, the data table will contain p+ q + r columns.

A variable is actually an object which has other attributes in addition toits values. Most importantly, variables can be classified according to their type.Measurement theory defines scales according to the information preserved undercertain mathematical operations. These scales are often referred to as nominal,ordinal, interval, and ratio. Mosteller and Tukey [1977] categorize variablesaccording to their nature: names, grades, ranks, counts, counted fractions,amounts, and balances.

This book classifies variables functionally according to the applicable typeof statistical analysis: categorical, ordered categorical, discrete numerical, andcontinuous numerical. The “values” taken by a categorical variable are names orlevels. An ordered categorical variable has categories which represent some orderor relative position. Ordered categorical variables could have values representingranks. Discrete numerical variables take on distinct values in a given range.These values are generally counts. A continuous variable can take on any valuein a given range, which are generally measured in some units.

A variable may also have a role, and possibly a parameterized distribution,in addition to its type. The role identifies whether the variable will be anoutcome variable (Y ), an explanatory variable (X), a conditioning or “nuisance”variable (Z), a weight variable (W ), a frequency variable (F ), or a label identifier(L). The distributions commonly used in applications include the binomialand multinomial for categorical variables; the Poisson, binomial, and negativebinomial for discrete numerical variables; the normal, log-normal, and gammadistributions for continuous numerical variables.

The type, role, and distribution constrain the class of plots and analyseswhich should be applied to a variable or a group of variables. For example, ifthe species count recorded on sites is assigned the role “Y ”, the type “discrete”,and the distribution “Poisson” and the habitat diversity score is assigned therole “X” and the type “continuous,” then a Poisson regression may be the mostappropriate analysis.

The variable attributes described above collectively are called metadata, i.e.,they are data about the data. In order to do guided explorations or modelingof the data, a rich metadata environment is essential.

Other types of metadata are defined for each individual in the sample. Theseattributes are called state variables and their assigned “values” are useful withina statistical computing and graphics environment. State variables are not truevariables since their values are determined by the investigator. The most com-mon state characteristics are color and symbol, which determine the visual rep-resentation of the individual in plots; mask, which determines whether or not


the individual is included in the analyses and plots; and label, which defines anidentifier.

1.3 Individual and Variable Space

Assuming all variables are numeric, the rows of the data matrix consist of vectorsof length p; the columns consist of vectors of length n. Individual space is ap-dimensional space in which the n individuals (rows) are represented by points.The scatter plot is a special case when p = 2. Variable space is a n-dimensionalspace in which the p variables are represented by vectors. This conventionof representing individuals by points and variables by vectors is also used inbiplots in which both individuals and variables are represented in the samespace. Figure 1.1 conceptually illustrates individual and variable space.

1.4 An Overview

This book presents several classes of multivariate models. Generally, these mod-els are based on the multivariate normal distribution. However, extensions aredeveloped when the normality assumption is not tenable or when alternativedistributional assumptions are more appropriate.Numerical Summaries (Chapter 2)

Arithmetic means, standard deviations, covariances, and correlation coeffi-cients are the primary summary statistics for continuous numerical variables.Robust versions of these statistics are increasingly being used. Counts and cu-mulative counts are the principal summary statistics for categorical and orderedcategorical variables, respectively. Counts also naturally arise from Poisson orbinomial distributions for discrete numerical variables.Graphical Techniques (Chapter 3)

Exploratory plots are used to find patterns and structure in the data. Trends,relationships, outliers, and unusual behavior are often revealed, which wouldotherwise go unnoticed. Plots should be interactive and dynamic; have theability to respond to new or changed information; be able to activate otherdisplays or analyses. Multiple plots (views) of the same data table should belinkable to show relationships among several variables even if they are of differenttypes.

The most useful plot for viewing and interacting with multivariate datais the 2-dProj , which dynamically projects high-dimensional data into a two-dimensional plane using orthogonality constraints [Hurley and Buja, 1990]. Othergraph objects, including univariate and bivariate plots, are discussed along withthe tools for controlling them.

Many multivariate techniques are based on a group of continuous numericalvariables with an assumed multivariate normal distribution. Statistics basedon these variables, such as distance measures, then have derived distributions(such as the gamma). This chapter assesses the underlying assumptions of the


> biplot(princomp(airpoll), cor=TRUE)

−0.4 −0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

0.4

Comp.1

Com

p.2 akronOH

albanyNY

allenPA

atlantGA

baltimMD

birmhmAL

bostonMAbridgeCTbufaloNY

cantonOH

chatagTN

chicagIL

cinnciOHclevelOH

colombOHdallasTXdaytonOH

denverCO

detrotMI

flintMI ftwortTX

grndraMI

grnborNC

hartfdCT

houstnTXindianIN

kansasMO

lancasPA

losangCA

louisvKY

memphsTN

miamiFL

milwauWI

minnplMN

nashvlTN

newhvnCT

neworlLA

newyrkNY

philadPA

pittsbPA

portldORprovdcRIreadngPA

richmdVA

rochtrNY

stlousMO

sandigCA

sanfrnCA

sanjosCA

seatleWA

springMAsyracuNY

toledoOHuticaNY

washDC

wichtaKS

wilmtnDE

worctrMA

yorkPA

youngsOH

−10000 −5000 0 5000

−10

000

−50

000

5000

RainfallEducationPopden NonwhiteNOXSO2Mortality

Figure 1.1: Conceptual Views of Individual and Variable Spaces


original variables and certain fundamental statistics using both numerical andgraphical aids. If normality is not tenable, then variables may be transformed.

Transforming to attain normality is discussed along with transforming toattain symmetry, homogeneity of variances, additivity, and linearity. Finallymethods are presented for identifying outliers and influential observations inp-dimensional individual space.Correlation Analysis (Chapter 4)

The relationships within a group or among two or more groups of continuousnumerical variables is often of interest. The results are most meaningful whenthe underlying distribution is the multivariate normal. The simple product-moment correlation measures the strength of the linear relationship betweentwo numerical variables. The correlations between all pairs of variables aregathered together in the correlation matrix. Partial correlation determines thecorrelations among Y1, Y2, · · · , Yp after removing the linear effects due to thenumerical variables X1, X2, · · · , Xq.

A common problem is to determine the strength of the linear relationshipbetween X1, X2, · · · , Xq and a single Y . The variable in X-space “closest” to Ydefines a multiple regression. A measure of the strength of this relationship iscalled the multiple correlation. Multiple regression is more commonly definedwhen the X’s represent design variables, but the multiple correlation is thenmeaningless.

A generalization of multiple correlation is to examine the relationships be-tween two groups of variables, one determining the X-space and the other theY -space. The canonical variables are the variables in the X-space and Y -spacewhich are “closest” to each other while preserving certain orthogonality con-straints. The canonical correlations measure the strength of the linear relation-ships defined by the canonical variables.Principal Component Analysis (Chapter 5)

This method examines the internal structure of a single sample based onnumerical variables, Y1, Y2, · · · , Yp. This technique is closely related to the di-mensionality problem. Although the data may lie in p-dimensional individualspace, the effective dimension may be s < p, i.e., the data “essentially” lies in ans-dimensional subspace. Principal component analysis finds the axes (principalvariables) of greatest variability and the axes statistically representing (near)collinearity.

Redundancy analysis specializes principal component analysis by constrain-ing the (principal) axes to be linear combinations of the explanatory variables,X1, X2, · · · , Xq.

Biplots, as a generalization of 2-dProj , provide a means of representing boththe individuals and variables in the same space. That is, the biplot allowsthe relationships among the variables to be explored simultaneouly in terms ofindividual and variable space. Furthermore, regions of individual space can berelated to certain variables.Correspondence Analysis (Chapter 6)

Count data, whether in the form of “abundances” in the original data tableor in terms of frequency summaries for categorical variables, are not amenable


to principal components, or related techniques, which assume underlying con-tinuous numerical scales. Furthermore, the number of variables may greatlyexceed the number of individuals.

Correspondence analysis provides a graphical representation of the individ-uals and variables in a type of biplot. Canonical correspondence analysis, likeredundancy analysis, constrains the axes of the biplot to be linear combinationsof the explanatory variables X1, X2, · · · , Xq. Partial canonical correspondenceremoves the linear effect of the nuisance variables Z1, Z2, · · · , Zr.Factor Analysis (Chapter 7)

Factor analysis, like principal component analysis, aims to determine the ef-fective dimensionality of the outcome space. It differs in that a model is specifiedto explain the variation in the Y ’s in terms of q underlying, but unmeasurable,factor variables. The factor variables are indeterminate unless restrictions areplaces on their covariance structure. Different models are possible dependingon the method of ‘extracting’ the factors.Discriminant Analysis (Chapter 8)

The purpose of discriminant analysis is to differentiate among the g groupsdefined by the labels of a categorical variable Y using continuous numericalvariables X1, X2, · · · , Xq. The discriminant variables are the variables in theX-space which maximally separate the groups defined by Y . Y may or maynot represent a design variable. The results are most meaningful when the X’sare assumed to follow a multivariate normal distribution. Discriminant analysisis equivalent to a one-way multivariate analysis of variance. Classification usesthe discriminant function to classify an individual into one of the g groups.Multivariate General Linear Models (Chapter 9)

These models fit continuous numerical outcome variables Y1, Y2, · · · , Yp interms of linear combinations of the explanatory variables X1, X2, · · · , Xq. Threecases, depending on the types of the X’s, are considered. Multivariate regres-sion, multivariate analysis of variance (MANOVA), and multivariate analysisof covariance (MANCOVA) result by considering the X’s to be continuous nu-merical, categorical, and mixture of continuous numerical and categorical, re-spectively. The X’s are considered to be design variables. The results are mostmeaningful when the Y ’s are assumed to be multivariate normal. A commontype of problem concerns repeated measurements made on the same individualover time (or space). Provisions must be made for the correlations induced bythe repeated measures.Exploratory Projection Pursuit (Chapter 10)

Projection pursuit analyzes high-dimensional individual space, i.e., p > 3 us-ing projections into low-dimensional spaces. Meaningful projections are foundby optimizing an objection function, often defined in terms of an index. Ob-jection functions can be defined in terms of the covariance matrix, but thesefunctions simply find classical multivariate solutions such as principal compo-nent or discriminant projections. The principal use of projection pursuit is tofind nonlinearities in high-dimensional data which cannot be found by standardmultivariate techniques.Cluster Analysis (Chapter 11)


Distances can be computed between all possible pairs of individuals in asample or similarities can be computed between pairs of variables. A collec-tion of techniques have been developed to analyze data summarized in terms ofthe resulting dissimilarity or similarity matrices. Hierarchical cluster analysisgroups individuals or variables together in such a way that the groupings be-comes successively more diffuse as their size increases. The relationships amongthe individuals or variables are represented in a tree structure. Non-hierarchicalforms of cluster analysis create a specified number of groupings (at least ap-proximately) of individuals or variables in such a way that individuals within agroup are more similar than individuals across groups.Multidimensional Scaling (Chapter 12)

Multidimensional scaling attempts to find a representation of the individu-als or variables in a low-dimensional space such that their interpoint distancescorrespond to their dissimilarities or similarities. Both metric and nonmetricmethods are available and extensions allow a type of biplot to be generated.

Chapter 2

Numerical Summaries

Classical multivariate techniques depend on certain linear and quadratic func-tions of the data. Sample means, covariances, and correlation coefficients aredefined in this chapter. Robust versions of these statistics are also developed.

2.1 Measures of Location

The arithmetic sample mean of the jth outcome variable is defined by:

mean(Yj) = yj =∑ni=1 yijn

This sample mean is an estimate of the population mean of Yj which is denotedby µj . The sample mean vector for Y1, Y2, · · · , Yp is given by:

y ′ =[y1 y2 · · · yp

]The corresponding population mean vector is given by:

µ ′ =[µ1 µ2 · · · µp

]The mean vector for the state crime variables are:

> library(mult)

> data(airpoll)

> n <- nrow(airpoll)

> round(colMeans(airpoll), 2)

Rainfall Education Popden Nonwhite NOX SO2 Mortality37.37 10.97 3866.05 11.87 22.65 53.77 940.38

The sample mean is a measure of the location or center of a variable. If thedistribution of the data is skewed or has outliers, the sample mean may not bea good measure of the center. The sample median defines the middle value of

9

CHAPTER 2. NUMERICAL SUMMARIES 10

the data and it is essentially unaffected by extreme values. However, it is notuseful for the standard multivariate techniques. A more flexible way to dealwith outliers is to use a class of robust estimators, called M-estimators, whichare developed in Section 2.4. These estimators naturally extend most of themultivariate analyses developed in this book.

2.2 Measures of Scale and Association

The sample covariance between two variables, Yj and Yk, is a measure of theirassociation and it is defined by:

cov(Yj , Yk) = sjk =∑i(yij − yj)(yik − yk)

n− 1.

The sample covariance between Yj and Yk estimates the corresponding pop-ulation covariance—denoted by σjk. The sample covariance matrix of Y ′ =[Y1 Y2 · · · Yp

]is given by:

S =

s2

1 s12 · · · s1p

s21 s22 · · · s2p

......

. . ....

sp1 sp2 · · · s2p

=[sjk]

where sjj = s2j and sjk = skj . The jth diagonal elements of S, s2

j = var(Yj), iscalled the sample variance of Yj . The positive square root of the sample varianceof Yj , sj , is called the sample standard deviation. The standard deviation is ameasures of spread (or dispersion) of the variable Yj .

The direction of the relationship between Yj and Yk is determined by thesign of sjk, i.e.,

sjk > 0 ⇒ Yj and Yk are positively linearly related,sjk = 0 ⇒ Yj and Yk are not linearly related,sjk < 0 ⇒ Yj and Yk are negatively linearly related.

The covariance matrix for the state crime variances are:

> round(airCov <- cov(airpoll), 1)

Rainfall Education Popden Nonwhite NOX SO2 MortalityRainfall 99.7 -4.1 -128.7 36.8 -225.4 -67.7 316.4Education -4.1 0.7 -290.9 -1.6 8.8 -12.6 -26.8Popden -128.7 -290.9 2144699.4 -166.5 11225.6 39581.7 23813.6Nonwhite 36.8 -1.6 -166.5 79.6 7.6 90.1 357.2NOX -225.4 8.8 11225.6 7.6 2146.8 1202.4 -223.5SO2 -67.7 -12.6 39581.7 90.1 1202.4 4018.4 1679.9Mortality 316.4 -26.8 23813.6 357.2 -223.5 1679.9 3870.4


A shortcoming of the covariance is that it does not quantify the strengthof the linear relationship between two variables. The covariance can be madearbitrary large in absolute value by changing the units of measurement. Thesample correlation coefficient overcomes this objection. It is defined by:

corr(Yj , Yk) = rjk =sjksjsk

It can be shown that −1 ≤ rjk ≤ 1. This quantity is also known as thePearson product-moment correlation coefficient.

The interpretation of rjk is similar to that of sjk except that rjk is bounded.A value of 0 implies no linear association. As rjk → 1 the strength of the positivelinear relationship increases and as rjk → −1 the strength of the negative linearrelationship increases.

The sample correlation matrix of Y ′ =[Y1 Y2 · · · Yp

]is defined by:

R =

1 r12 · · · r1p

r21 1 · · · r2p

......

. . ....

rp1 rp2 · · · 1

=[rjk],

where rjk = rkj .

> round(cor(airpoll), 3)


The correlation matrix is the covariance matrix of the standardized variables,which are defined by:

Yj − yjsj

, j = 1, 2, · · · , p

A standardized variable has a sample mean of 0 and sample standard deviationof 1.

> airMean <- apply(airpoll, 2, mean)

> airSD <- sqrt(apply(airpoll, 2, var))

> round(airSD, 3)

Rainfall Education Popden Nonwhite NOX SO2 Mortality9.985 0.845 1464.479 8.921 46.333 63.390 62.212


> airCenter <- sweep(airpoll, 2, airMean)/sqrt(n - 1)

> airStd <- as.matrix(sweep(airCenter, 2, airSD, FUN="/"))

> round(airStd[1:5,], 3)

Rainfall Education Popden Nonwhite NOX SO2 MortalityakronOH -0.018 0.066 -0.055 -0.045 -0.021 0.011 -0.039albanyNY -0.031 0.004 0.037 -0.122 -0.036 -0.030 0.120allenPA 0.086 -0.181 0.035 -0.162 -0.047 -0.043 0.046atlantGA 0.126 0.020 -0.066 0.222 -0.041 -0.061 0.088baltimMD 0.073 -0.212 0.229 0.183 0.043 0.313 0.273

> round((t(airStd) %*% airStd), 3)


The correlation matrix is easily defined in terms of the sample covariancematrix. Let D1/sj be a diagonal matrix with the reciprocal of the standarddeviations on the diagonal. Then:

R = D1/sjSD1/sj

defines the sample correlation matrix.

> D <- diag(1/sqrt(diag(airCov)))

> R <- D %*% airCov %*% D

> dimnames(R) <- list(names(airpoll), names(airpoll))

> round(R, 3)


2.3 Derived Variables

A central feature of multivariate analysis for continuous numerical variables isthe determination of variables V1, V2, · · · , Vs in Y -space that satisfy an optimal-ity criterion. These derived variables define a subspace V with dim(V ) ≤ s.


Since Vj is a variable in Y -space, it is expressible as a linear combination of thebasis variables, Y1, Y2, . . . , Yp. Specifically,

Vj = aj1Y1 + aj2Y2 + · · ·+ ajpYp

= a′jY

If the sample mean vector and covariance matrix of Y are known, it is easyto determine the sample mean and variance of Vj . They are given by:

vj = a′jy;

s2vj = a′jSaj .

Since the sample variance is always non-negative and aj is arbitrary, it is clearthat S is a positive semi-definite matrix.

Likewise, the sample mean vector and sample covariance matrix of V ′ =[V1 V2 · · · Vs

]is given by:

v = A′y

SV = A′SA,

where A is the p× s matrix of coefficients with the jth column given by aj . Asa special case, the covariance between two derived variables is:

cov(Vj , Vk) = a′jSak.

> airSVD <- svd(airCenter)

> (airSVD$d^2)[1:3]

[1] 2145755.455 4789.199 3130.723

> A <- airSVD$v[, 1:3]

> dimnames(A) <- list(names(airpoll), c("V1", "V2", "V3"))

> round(A, 5)

V1 V2 V3Rainfall 0.00006 0.03088 -0.12065Education 0.00014 -0.00420 0.00573Popden -0.99975 -0.02140 -0.00536Nonwhite 0.00007 0.06773 -0.05471NOX -0.00524 0.16063 0.65637SO2 -0.01849 0.68872 0.45584Mortality -0.01113 0.70274 -0.58633

> round(t(A) %*% A, 2)

V1 V2 V3V1 1 0 0V2 0 1 0V3 0 0 1


> vMean <- t(A) %*% airMean

> colnames(vMean) <- c("vMean")

> vMean

vMeanV1 -3876.6705V2 620.7092V3 -537.7980

> round(vCov <- t(A) %*% airCov %*% A, 4)

V1 V2 V3V1 2145755 0.000 0.000V2 0 4789.199 0.000V3 0 0.000 3130.723

2.4 Robust Measures

How do we judge an observation to be an outlier in multivariate space? Wemust assess the distance an observation lies from the “bulk” of the data by asuitably chosen metric. The general class of squared distance functions is useful.Let M be a positive semi-definite matrix. Then the squared distance betweenthe ith observation yi and y# (some estimate of location) is defined by:

d2i =

(yi − y#

)′M(yi − y#

).

M must be positive semi-definite to ensure that d2i ≥ 0, a basic requirement

of a distance function. Every choice of M has an associated ellipsoid formedby setting the squared distance to a constant. If M is the identity matrix,spherical or Euclidean squared distances result. M diagonal corresponds tosquared distances in which the axes of the ellipsoid are parallel to the coordinateaxes.

The squared distances above do not take the correlation among the variablesinto account. Choosing M = S−1 and y# = y results in the classical Maha-lanobis squared distances. The Mahalanobis distances will be used to computerobust M-estimators for µ and Σ by iteratively reweighting the mean vectorand covariance matrix.

The d2i are alway nonnegative, which is necessary since the weight function

must be defined over a nonnegative domain. The multivariate generalization ofHuber’s weight function is given by:

w(d) =

{k/d if d > k

1 if d ≤ k

for suitably chosen k. Now d2 ∼ χ2p, if the underlying data is normally dis-

tributed, where the degrees of freedom p is the dimension of Y -space. Therefore,


a reasonable choice for k is:

P (d2 < k2) ≈ 0.95.

The algorithm for computing the multivariate measures of location and dis-persion are:

1. Start with y and S where:

y =∑

yin

;

S =∑

(yi − y) (yi − y)′

n.

These are the maximum likelihood estimates of µ and Σ.

2. Compute the Mahalanobis distances given by:

d2i = (yi − y)′ S−1 (yi − y) , i = 1, 2, · · · , n.

3. Compute the weights w(di), i = 1, 2, · · · , n.

4. Replace y and S in step 2 by weighted estimates given by:

y∗ =∑w(di)yi∑w(di)

;

S∗ =∑w2(di) (yi − y) (yi − y)′∑

w2(di).

5. Iterate steps 2, 3, and 4 using estimates from step 4 until convergence isachieved.

The weights will be 1 or near 1 for well-behaved data. On the other hand,outliers are detected as those observations with low weights.

> Mest <- huber.estimates(airpoll)

> round(Mest$huber.mean, 2)

Rainfall Education Popden Nonwhite NOX SO2 Mortality[1,] 37.99 10.95 3811.26 11.96 16.29 52.19 942.53

> round(Mest$huber.cov, 1)

Rainfall Education Popden Nonwhite NOX SO2 MortalityRainfall 74.8 -3.1 -166.3 35.0 -33.0 -19.6 267.7Education -3.1 0.6 -199.8 -1.6 -0.4 -14.8 -24.7Popden -166.3 -199.8 1834125.0 321.7 7368.6 40559.8 28503.1Nonwhite 35.0 -1.6 321.7 81.2 27.6 103.8 354.7NOX -33.0 -0.4 7368.6 27.6 278.9 742.4 275.9SO2 -19.6 -14.8 40559.8 103.8 742.4 3848.1 1800.3Mortality 267.7 -24.7 28503.1 354.7 275.9 1800.3 3697.1


> round(Mest$weights, 3)

[1] 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.988[13] 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000[25] 1.000 1.000 1.000 1.000 0.144 1.000 1.000 0.777 1.000 1.000 1.000 1.000[37] 0.916 1.000 1.000 0.868 1.000 1.000 1.000 1.000 1.000 1.000 0.708 0.288[49] 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.606 1.000

The above procedure is computer intensive. An alternative is to stop theiteration after a specified number of steps. One-step and two-step estimatorsare commonly used. The robustness properties of these latter estimators aregenerally good.

These types of M-estimators have two undesirable properties. First, thebreakdown of an estimator, denoted by ε, is the proportion of outliers it canhandle. The Huber M-estimator has a breakdown of ε ≤ 1/p. This generallyis not a problem unless p is large relative to n. The second potential problemis that S∗ may become singular at some stage of the iteration. This can onlyhappen with unusual patterns of outliers and when p is large relative to n.

Chapter 3

Graphical Techniques

Graphs are powerful tools for revealing the structures and patterns, or the id-iosyncracies, found in multivariate data. Some plots are designed to assessthe underlying assumptions or the fit of a model, whereas others represent thestructure of a multivariate dataset. This chapter examines both types of plots.

3.1 Assessing Distributional Assumptions

Most multivariate models assume an underlying normal distribution. This as-sumption will be assessed graphically in several ways. Although the objective ofthese techniques is to assess multivariate normality, the process will begin withunivariate plots and will be extended naturally to the multivariate case.

3.1.1 Quantile Plots

Sample quantiles provide critical information about the distribution of the sam-ple values of a variable. The qth sample quantile, denoted by yq, is a value alongthe measurement scale with a proportion q or less of the data less than yq and aproportion 1− q or less of the data greater than yq, i.e., for the random variableY , yq is any value satisfying:

#yi < yqn

≤ q and#yi > yq

n≤ 1− q.

The value of yq may not be unique because a sample value may not satisfy thedefinition, e.g., the median when the number of observations is even. In thiscase, yq is determined by interpolating linearly between adjacent ordered valuesin the quantile plot described below.

Quantile plots consist of graphing the yq on the y-axis versus q on the x-axis.Typically, equally spaced values of q between 0 and 1 are chosen. Unless n islarge, the values of q are chosen to be in one-to-one correspondence with thesample values.

17

CHAPTER 3. GRAPHICAL TECHNIQUES 18

Let y(1) ≤ y(2) ≤ · · · ≤ y(n) be the sample order statistics and:

qi(α) =i− α

n− 2α+ 1for 0 < α < 1/2.

Then y(i) is a sample quantile corresponding to qi(α), for i = 1, 2, . . . , n.A quantile plot can be constructed by plotting qi(α) against y(i). Various

choices of α are possible, but α = 1/2 has the desirable property of havingsymmetrical plotting positions about 0.5 from 1/2n to 1 − 1/2n. Choices suchas i/n, which are typically used to plot the empirical cumulative distributionfunction, are not acceptable since theoretical quantiles are often computed, andgenerally they are not defined for q = 0 or 1. Since α = 1/2 will be usedthroughout this book, let qi = (i− 1/2)/n.

Sample quantile plots are effective for assessing the distribution of a numer-ical variable. First, all of the data are displayed for the qi defined above. Inaddition, characteristics of the data can be inferred, including the existence ofsymmetry and outlying values. Finally, sample quantiles or proportions can beread directly off the graph.

Symmetry is easily assessed if a horizontal line is drawn through the median,y0.5, on the sample quantile plot. Let:

d(qi) = |y0.5 − yqi |

The sample distribution is approximately symmetric if d(qi) ≈ d(1 − qi), i =1, 2, . . . , [n/2], where [n/2] is the largest integer not greater than n/2. A distri-bution is positively skewed if, in general, d(1− qi)− d(qi) > 0 and is monotoni-cally increasing for qi, i = 1, 2, . . . , [n/2]. The same conditions with a reversedinequality indicates a negatively skewed distribution.

Positively skewed data is very common. It is almost guaranteed if the data isnecessarily positive and ranges over several orders of magnitude. The problemis compounded for multivariate data since the direction of skewness in p-spacemust be determined. This will be discussed subsequently.

A large vertical jump from one sample quantile to the next indicates that thegap between consecutive ordered values is wide. This usually occurs in the tailsof the distribution and indicates extreme or outlying values. Likewise, smalljumps over many consecutive values indicates a range in which values have ahigh probability of occurring. Clumping or clustering is indicated if this rangehas large vertical jumps on its boundaries.

3.1.2 Probability Plots

A quantile plot is useful for assessing symmetry, but it does not confirm normal-ity nor does it assess whether or not an extreme value is an outlier relative tothe normal distribution. This section develops a graphical method for makingthese judgments.

A sample quantile-quantile plot is often used to compare two sample distribu-tions measured on the same variable, i.e., it is a graphical method of displaying


the two-sample problem. It is constructed by plotting the quantiles of one sam-ple against those of the other, using the qi corresponding to the smaller samplesize.

This approach can be used to assess the distributional assumptions under-lying a sample. In this case, the quantiles of the sample are plotted against thecorresponding quantiles of a theoretical distribution. This is called a probabilityquantile-quantile plot, or a probability plot for short. The most common caseis the normal probability plot.

Denote the cumulative distribution function (cdf) of the random variable Yby F , where F is defined by:

F (y) = P (Y ≤ y) for −∞ < y <∞.

The cumulative distribution function is nondecreasing with a range of [0, 1]. Theqth theoretical quantile is any value yq satisfying:

F (yq) = q.

If Y is discrete, yq may not be unique. In this case, choose yq as the middlevalue in the range of values satisfying the definition.

Let yqi be the theoretical quantile corresponding to qi. If F is true, then:

yqi ≈ y(i) for i = 1, 2, . . . , n.

The probability plot is the graph of the pairs (yqi , y(i)). The points shouldfall approximately along a straight line with intercept 0 and slope 1 if F is true.Points systematically deviating from this line provides evidence that the datais not consistent with F . “Slight” deviations from linearity will occur even if Fis true due to sampling variability, but no patterns or large deviations shouldoccur.

The development so far assumes that a plausible underlying F is completelyspecified. This is not likely to be the case. Generally F is a member of aparametric family. If F is determined by location and scale parameters, such asthe normal, the construction of a probability plot is simplified.

Assume that F (y) = G(y−θ1θ2), i.e., F only depends on a location parameter

θ1 and a scale parameter θ2. Then the qth quantile of F is given by:

yFq = θ1 + θ2yGq ,

where yGq is the qth quantile of G. This follows since:

q = P (Y ≤ yFq ) = F (yFq ) = G(yFq − θ1

θ2) = G(yGq ).

The cumulative distribution function G is called the standard or canonical dis-tribution of the family of distributions F indexed by θ1 and θ2. G has a locationparameter of 0 and a scale parameter of 1.


A probability plot based on the quantiles of F can be constructed withoutestimating θ1 or θ2. If Y ∼ F , then the points (yGqi , y(i)) will fall approximatelyalong a line with an intercept of θ1 and a slope of θ2. The adequacy of F is basedon linearly and it is independent of the parameter values. If linearly holds, itis possible to estimate θ1 and θ2 by the intercept and slope of the least squareregression fit, but this procedure generally is not optimal.

The normal probability plot is used for assessing normality. In this casethe location parameter is the mean µ and the scale parameter is the standarddeviation σ. The canonical distribution function is called the standard normalwhich has a mean of 0 and a standard deviation of 1.

The overall pattern of the points often falls into one of several classes, whichdefine specific types of deviations from normality. In order to classify the pat-tern, it is convenient to fit a line through the points. It would seem reasonable tofit a linear least squares line, but this is inappropriate if normality does not hold.A better alternative is to draw a line through the lower and upper quartiles, i.e.,(yG0.25, y0.25) and (yG0.75, y0.75), respectively. In order to provide a contrast to thisrobust (quartile) linear fit, a locally linear or quadratic regression (loess) curvecan be fitted to the points. The loess curve reveals nonlinearities without thelimitations of a parametric model.

Five patterns are commonly observed. First, approximate linearity supportsthe assumed normal distribution. The curve is convex, or concave upwards, forpositively skewed data, and concave downwards for negatively skewed data. Ifthe data is positively skewed, the values in the upper tail of the distribution arespread out relative to the corresponding normal quantiles, whereas the valuesin the lower tail are less spread out than the corresponding normal quantiles.An opposite result holds for negatively skewed data.

Suppose the points in the upper tail are above the line and those in thelower tail are below the line. This indicates that the values in both tails aremore spread out than the corresponding normal quantiles. A distribution withthis property is said to be thick-tailed or leptokurtic. Finally, consider the casein which the points in the upper tail are below the line and the points in thelower tail are above the line. This indicates that the values in the tails are tooclose relative to the corresponding normal quantiles. A distribution with thisproperty is said to be short-tailed or platykurtic. The loess fit easily exposesthese patterns if they are present.

Outliers can easily be identified from the normal quantile plot by drawingappropriate horizontal reference lines. The sample interquartile range is definedby:

iqr(Y ) = y0.75 − y0.25.

The reference lines should be drawn at y0.25 − 1.5(iqr), y0.25, y0.5, y0.75, andy0.75 + 1.5(iqr). Values above the upper reference line or below the lower refer-ence line are classified as outliers relative to the normal distribution. Actually, itis only necessary to draw the outer reference lines, but the others are useful sincethey identify the quartiles. These reference lines essentially define the standardboxplot. It is sometimes useful to draw two more reference lines at y0.25−3(iqr)


and y0.75 + 3(iqr) to distinquish between mild and extreme outliers.

> library(mult)

> data(airpoll)

> attach(airpoll)

> qqnorm(Mortality)

> qqline(Mortality, col=2)

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

800

850

900

950

1000

1050

1100

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> normalQQ(airpoll[,1:6], nr=2, nc=3)


●●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

−2 −1 0 1 2

1020

3040

5060

Rainfall Normal QQ−Plot

Normal Quantiles

Rai

nfal

l

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●●●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 29.

09.

510

.511

.5

Education Normal QQ−Plot

Normal Quantiles

Edu

catio

n

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

2000

4000

6000

8000

1000

0

Popden Normal QQ−Plot

Normal Quantiles

Pop

den

●

●

●

●

●

●

●●

●●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

010

2030

40

Nonwhite Normal QQ−Plot

Normal Quantiles

Non

whi

te

●●●●

●●●

●●●●

●

●●●

● ● ●

●

●● ●●● ●●● ●

●

●

●

●

●●●

●●

●●

●

●

●●●●

●

●

●

●

●●●●●

●

●●

● ● ●

−2 −1 0 1 2

050

100

200

300

NOX Normal QQ−Plot

Normal Quantiles

NO

X

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●

−2 −1 0 1 2

050

100

150

200

250

SO2 Normal QQ−Plot

Normal Quantiles

SO

2

3.1.3 Multivariate Residual Plots

Let y′ =[y1 y2 · · · yp

]and S be the sample mean vector and covariance

matrix, respectively. The scaled multivariate residuals are defined by:

ri = S−1/2(yi − y),

where S−1/2 is the inverse of the symmetric square root of S.Then the

d2i = r′iri = (yi − y)′S−1(yi − y)

are called the squared Mahalanobis distances. Under multivariate normality,the d2

i are distributed as a multiple of a beta. However, for n ≥ 30 the d2i

approximately follow a chi-square distribution with p degrees of freedom. Thechi-square with p degrees of freedom is a special case of the gamma with a shapeparameter of p/2 and a scale parameter of 2. Thus, multivariate normalitycan be assessed indirectly by a gamma probability plot. Fortunately, it is notnecessary to estimate the parameters of the gamma to assess normality, since pis known.

> gammaQQ(airpoll)


● ●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●

●

●

●

2 4 6 8

010

2030

40

Gamma Quantiles

Mah

alan

obis

Dis

tanc

es^2

The ri can be viewed as vectors in individual space. As such, not only theirlength but also their orientation relative to any axis can be determined. Choosean axis; for example,

[1 0 · · · 0

]corresponds fo Y1. Let θi be the angle

between ri and[1 0 · · · 0

]. Then θi is uniformly distributed distributed on

(0, 2π) or θ∗i = θi/(2π) is uniformly distributed on [0, 1] if normality holds.It is best to consider the d2

i and θ∗i jointly. First, let

qi = P (χ2p ≤ d2

i ).

This is the probability integral transformation and the qi are distributed uni-formly on [0, 1] if the d2

i have an underlying chi-square distribution. Then the(qi, θ∗i ) should fall randomly on the unit square if the underlying distribution ofthe Yi is normal.

By examining the pattern of points in the unit square, it is possible todetermine the regions of individual space where heavy and light concentrationsof points fall relative to the selected reference axes.

3.1.4 Directional Normality

The preceding discussion allows the researcher to examine directional normalityfrom the plots of (qi, θ∗i ) relative to various orientations. However, it is notalways easy to find the direction of nonnormality from this plot.


An alternative is to define:

dα =∑wiri

‖∑wiri ‖

,

where wi =‖ ri ‖α. Notice that d1 points to the “large” ri, whereas d−1 pointsto the “small” ri. Generally, any α > 0 results in an orientation pointing to theri far from 0.

Let d∗α = S1/2dα. The d∗α are the orientations in the original Y -space. Asan example, let d∗

′

α =[1/2 1/2 0 · · · 0

]. This would say that positive

skewness or outliers exist along (1/2)Y1 + (1/2)Y2. Corrective action, e.g., bytransforming (1/2)Y1 + (1/2)Y2, can then be taken.

3.2 Visualizing Multivariate Data

The problem of multivariate data visualization is to find effective methods fordisplaying p-dimensional data (individual) space on a 2-dimensional computerscreen. The resulting views can be static or they can dynamically change overtime.

The underlying graphical model is based on projecting p-dimensional dataspace into a d-dimensional subspace or plane. Although the methodology worksequally well on a 3-dimensional holographic or stereoscopic display device, a2-dimensional renderer (usually a computer screen) is assumed in this book.Special techniques are required to visualize projections of dimension d > 2.Although the emphasis is on projecting data points, it is often useful to constructfunctionals on the d-dimensional projected data values, e.g., multivariate densityestimates, or to project functionals defined on the original p-dimensional space.

The projection plane P is defined by a basis called a d-frame. Specifically,denote the basis vectors by f1, f2, · · · , fd and the p × d matrix of basis vectorsby F. Although not required, the fj are usually constructed to be orthonormal.The projection plane is defined by all linear combinations of the basis vectorsand is denoted by span(F). If the fj are orthonormal, the projection matrix isFF′, i.e., FF′y projects y into span(F). FF′ is a p × p symmetric idempotentmatrix of rank d. The projection of y along fj is given by (f ′jy).

Let e′j =[0 0 · · · 1 · · · 0

], where the 1 is in the jth position. The

vectors e1, e2, · · · , ep span the full p-dimensional space and define the canon-ical basis. All traditional plots are defined in terms of the canonical basis,e.g., a 2-D scatterplot. For example, the projection of y onto span

[e1 e2

]is[

y1 y2 0 · · · 0]′. The “values” of Vj (a′jyi, i = 1, 2, · · · , n) correspond to

the projections f ′jyi (i = 1, 2, · · · , n) along fj .The original variables are in 1-1 correspondence with the canonical axes, e.g.,

Yj corresponds to ej . Likewise, Vj = a′jY corresponds to fj =[e1 e2 · · · ep

]aj .

Thus, each frame basis vector is a linear combination of the canonical basis vec-tors and corresponds to the same linear combination of the original variables.

The next four subsections discuss static plots in d = 1, 2, 3, and > 3 di-mensions. Specifically, the projection plots will be defined in terms of the axes


defined by the d-frame. Finally, a dynamic graphical model is introduced byprojecting into a sequence of d-frames over time.

3.2.1 1-D Plots

The 1-D projection of the p-dimensional point cloud onto f1 results in pointsdistributed along the line defined by span(f1). Of course, the most commonsituation is to project onto a canonical axis. The distribution of values along anaxis is of little use since the points often overlap. The second screen dimensioncan be used to learn more about the distribution of values of V1. For example,if the points are stacked, we get a dot plot. More generally, the points can berandomly jittered in this second dimension to produce a jitter or a textured dotplot.

A more informative plot is obtained by using the second dimension to plotan estimated density function. It is certainly possible to construct a parametricdensity estimate, e.g., a normal density estimate, but nonparametric smoothersusually give a better representation.

The density histogram is the simplest smoother. The range of the data isdivided into k nonoverlapping intervals (bins). The number of points in each binis counted (e.g., freqj is the frequency for the jth bin). If the bins are of equalwidth, the bin origin y0 (the leftmost boundary) and the bin width h, along withthe frequencies, completely determine the histogram. The density estimate, f ,is a step function. The density estimate in the jth bin, fj ,has height 1

nh × freqjand extends for a fixed width of h from yj−1 to yj . These values ensure that thetotal area of the density histogram is 1. Sturges ‘rule’ suggests that the numberof bins k should be approximately 1 + log2 n. The bin width h, or equivalentlythe number of bins, determines the smoothness of the density estimate. Theviewer should be able to dynamically change the degree of smoothhness.

Discussion of kernel density estimates.

3.2.2 2-D Plots

The projection of the p-dimensional point cloud onto span(f1, f2) results in a2-D scatterplot. The basis vectors f1 and f2 are assigned to the horizontal andvertical screen dimensions, respectively. The usual scatterplot is defined in termsof any pair of canonical axes. The scatterplot is used to estimate the strengthand nature of the relationships between the “variables” defined by the 2-frame.

Discussion of 2-D kernel density estimates.

3.2.3 3-D Plots

The projection of the p-dimensional point cloud onto span(f1, f2, f3) results in a3-D scatterplot. The basis vectors f1, f2, and f3 are assigned to the horizontal,vertical, and depth directions, respectively. Depth can be“seen”by stereo views,depth cues, or 3-D rotations.


3.2.4 (d > 3)-D Plots

The projection of the p-dimensional point cloud onto span(f1, f2, · · · , fd) resultsin a d-dimensional “scatterplot.” For d > 3, it is not possible to view theprojection directly.

One possibility is to create a d× d matrix of 2-D scatterplots. Each plot isa panel in the overall display. All plots in a row have the same vertical axis andall plots in a column have the same horizontal axis.

The horizontal and vertical axes of the plot in the ith row (from the bottom)and the jth column (from the left) are assigned to fi and fj , respectively. Thesame plot also appears in the jth row and the ith column with the scales reversed.This redundancy allows the plots to be linked visually as the viewer scans byrows or by columns. If only the lower or upper triangular portion is plotted, itwould be necessary to turn the corner and reverse the scales at the diagonal tofollow the sequence of plots for a given variables versus all others.

A full p-D plot is possible. In this case, the full data space is spanned by(f1, f2, · · · , fp). The standard scatterplot matrix uses the canonical basis.

Brushing is a direct graphical manipulation tool which is well-suited to scat-terplot matrices. A brush is a user-chosen rectangle which can be moved overand around any panel in the scatterplot matrix. All points within the rectangleare highlighted, and possibly labeled, along with the corresponding points in allother panels. The brush is transient if only the points within the rectangle arehighlighted, whereas it is lasting if all points currently or previously brushedare highlighted.

All panels in the scatterplot matrix are automatically linked, i.e., a pointhighlighted by the brush is highlighted in all other panels. Linking can be used toexamine relationships conditionally. Consider making a thin rectangular brushwith its height equal to the height of a panel. Then position the brush on a panelwith the conditioning variable on the x-axis. As the transient brush is movedleft or right, the panels involving the other variables display their relationshipsconditionally and dynamically in terms of the highlighted points correspondingto the conditional band.

Discussion of parallel coordinates plots.

3.2.5 Dynamic Projection Plots

Statistical software packages increasingly have“spin”plots as part of their graph-ical offerings. The basic idea is to extend the standard 2-D scatterplot to threedimensions. Motion graphics is used to give “depth” to the third dimension.In this case, the 3-frame spans the entire 3-D space. Motion is obtained byorthogonal rotations of the frame over time, i.e., FV(t), where V(t) is the d×dorthonormal matrix at time t.

This approach is inherently limited, however, since only three variables canbe viewed simultaneously. Although another variable can be added to the spinplot (by removing a corresponding variable), only the 3-dimensional subspacespanned by the included variables can be displayed at a time. This greatly


inhibits the ability of an analyst to explore p-dimensional individual space. It isnever possible to examine arbitrary linear combinations involing all p variables.

Furthermore, even a relatively small p generates a large number of 3-dimensionalsubspaces defined by the standard coordinate axes. For example, if p = 5, 10subspaces must be explored. Even then, the 3-dimensional individual spacescan only be examined along certain coordinates.

A more general approach is to project observations dynamically from p-dimensional individual space to a moving d-dimensional subspace. Projectionsare made orthogonally onto a sequence of planes, Pa, Pb · · ·, which are guidedby the user. This strategy is implemented in XGobi and ggobi.

The sequence of planes is determined by interpolating between successivelychosen target planes. Consecutive interpolated planes, between the target planes,must be“close”to insure that the plot moves smoothly and that individual pointscan be followed. Geodesic interpolation, described below, provides the desiredsmoothness properties. The target planes are randomly selected from individualspace subject to constraints imposed by the user. The following discussion isgiven for 2-D projections, but the ideas hold in general.

The geodesic interpolation between two planes, Pa and Pb, is determined bytheir principal vectors. First, find the unit vectors fa,1 ∈ Pa and fb,1 ∈ Pb whichhave the smallest angle between Pa and Pb. Then find the unit vectors fa,2 ∈ Paand fb,2 ∈ Pb such that (fa,1, fa,2) and (fb,1, fb,2) are orthonormal bases for Paand Pb. These vectors are the principal vectors for Pa and Pb, and the anglesα, between fa,1 and fb,1, and β, between fa,2 and fb,2, are the principal angles.

Geodesic interpolation between Pa and Pb is determined by a sequence oforthonormal vectors (f1(t), f2(t)), where f1(t) moves from fa,1 to fb,1 along agreat circle and similarly f2(t) moves from fa,2 to fb,2. These vectors arrive attheir target simultaneously, but do not move at the same constant speed if αand β are unequal.

Let (fx, fy) be an orthonormal basis determining the next target plane, wherethe directions of fx and fy correspond to the horizontal and vertical plotting axes.It would not be reasonable for the user to enter the 2p numbers representingthese vectors with respect to the canonical basis vectors. Instead, the user guidesthe selection of the target planes by imposing constraints on how the planes arechosen.

Variables will be classified as: active (A), orthogonal (O), horizontal (X),or vertical (Y ). If a variable Yj is active, it can have a nonzero component inthe jth position in both fx and fy. If Yj is a horizontal variable it can only havea nonzero component in the jth position of fx, whereas if it is vertical, it canonly have a nonzero component in the jth position of fy. If Yj is orthogonal, orinactive, both components in the jth position are zero.

A set of mixed A, X, and Y variables is difficult to interpret and thus isforbidden. If the non-orthogonal variables are all X, or all Y , a single linearcombination results, which is also forbidden in a two-dimensional plot.

Chapter 4

Correlation Analysis

The fundamental form of correlation concerns the pairwise correlations amongY1, Y2, . . . , Yp, i.e., the calculation of the sample correlation matrix R. This iscalled simple correlation analysis and has been discussed in Section ??. A moredetailed examination of the relationships (or structure) in the Y -space is doneby principal component analysis (see Chapter ??).

This chapter will focus on the relationships between one or more explana-tory variables and one or more outcome variables. Specifically, multiple correla-tion analysis examines the relationships between X1, X2, . . . , Xq and a single Y ,which are jointly distributed numerical variables. The researcher is interestedin understanding Y in terms of X1, X2, . . . , Xq. Correlation analyses are appro-priate when the relationships are of principal interest and the Xi are randomvariables. In contrast, regression models are appropriate when prediction is theprincipal interest and the Xi are mathematical (fixed) variables. The term re-gression has been used in the literature to refer to both the fixed and randomcases. However, important differences exist between these two approaches, par-ticularly in regards to the interpretation of results. The correlation problem canbe viewed in the regression context, if Y is interpreted in terms of its conditionaldistribution given the Xi.

A more general formulation is to examine the relationships betweenX1, X2, . . . , Xq

and Y1, Y2, . . . , Yp, i.e., the relationships between the variables in the X-spaceand those in the Y -space. The analytical technique for exploring these relation-ships is called canonical correlation analysis. The relationship between a singleY and X1, X2, . . . , Xq is a special case.

Sometimes it is important to examine the relationships between Y (or Y1, Y2, . . . , Yp)and X1, X2, . . . , Xq after linearly adjusting for Z1, Z2, . . . , Zr. The Z’s are calledconditioning variables since inferences are based on the conditional distributionof X1, X2, . . . , Xq, Y1, Y2, . . . , Yp given Z1, Z2, . . . , Zr. The resulting analyses arejust standard multiple (or canonical) correlation analyses once Z1, Z2, . . . , Zrhave been “partialed” out of the X and Y spaces. These analyses are termedpartial multiple (or canonical) correlation analyses.

The correlation analyses discussed in this chapter are based on the assump-

28

CHAPTER 4. CORRELATION ANALYSIS 29

tion that [X1, X2, . . . , Xq, Y1, Y2, . . . , Yp] is a random vector with a multivari-ate normal distribution. Multivariate normal distribution thoery is developedin Appendix B. Multiple and partial correlation theory depend on the co-variance matrix of the conditional distribution of Y (or Y1, Y2, . . . , Yp) givenX1, X2, . . . , Xq. Canonical correlation theory is developed from the joint distri-bution of X1, X2, . . . , Xq and Y1, Y2, . . . , Yp.

Normality is not a requirement for correlation models, since the developmentcan be made in terms of certain first and second-order assumptions concerningthe conditional mean and covariance structure. However, normality leads toan unambiguous interpretation and permits standard inferential methods to beused. The assumption of normality often is not met; in this case, transformationsor robust analyses may be useful.

4.1 Multiple Correlation Analysis

The researcher is interested in two principal types of relationships between asingle Y and X1, X2, . . . , Xq:

• the relationships between Y and Xi adjusted for the other Xj ;

• the relationship between Y and the Xi considered jointly.

Partial correlation is identified with the first concern (Section 4.1.1); multiplecorrelation with the second (Section 4.1.2).

4.1.1 Partial Correlation

The partial correlation between Y and Xi, adjusted for the other Xj , is de-fined as the correlation between the residual variables computed by project-ing both Y and Xi onto the space spanned by X1, . . . , Xi−1, Xi+1, . . . , Xq.This is equivalent statistically to computing a regression of both Y and Xi onX1, . . . , Xi−1, Xi+1, . . . , Xq and correlating the resulting residuals. The samplepartial correlations between Y and Xi, after adjusting linearly for the other Xj

is denoted by rY Xi·X1,...,Xi−1,Xi+1,...,Xq .The relationship between Y and Xi is represented better by their partial

correlation, after adjusting for the other Xj , than by their simple correlation.Generally, rY Xi 6= rY Xi·X1,...,Xi−1,Xi+1,...,Xq . In fact, equality holds only if thecorrelations between Y and the other Xj are zero and the correlations betweenXi and the other Xj are also zero.

Partial correlations can be built up from simpler partial correlations. Thefollowing recursive formula gives the first-order partial correlations in terms ofthe simple correlations:

rY Xi·Xj =rY Xi − rY XjrXiXj√1− r2

Y Xj

√1− r2

XiXj

. (4.1)


This formula can be generalized to find the lth order partial correlationsin terms of the (l − 1)th order correlations. These relationship may aid ininterpreting regression coefficients (see Equation 4.2) as variables are added orremoved from the model. For example, it is possible for rY Xi to be large, butfor rY Xi·Xj to be small. The correlations on the right hand side of Equation4.1 are not arbitrary numbers between -1 and 1, since they must come from apositive-definite sample correlation matrix.

> library(mult)

> data(airpoll)

> attach(airpoll)

> X <- airpoll[,1:4]

> Y <- airpoll[,5:7]

The simple correlations among SO2, NOX, and Mortality are given by:

> cor(Y)

NOX SO2 MortalityNOX 1.00000000 0.4093936 -0.07752073SO2 0.40939361 1.0000000 0.42598462Mortality -0.07752073 0.4259846 1.00000000

The partial correlations can be found using the par.corr function in mult. Thecorrelations among SO2, NOX, and Mortality are adjusted linearly by Rainfall,Education, Popden, and Nonwhite.

> airpoll.pcor <- par.corr(X, Y)

> airpoll.pcor$partial.corr

NOX SO2 MortalityNOX 1.00000000 0.3470836 0.03683587SO2 0.34708359 1.0000000 0.39723802Mortality 0.03683587 0.3972380 1.00000000

The partial correlations are similar to the simple correlations. However, the signof the correlation between NOX and Mortality changes.

The correlation model is expressed in terms of the conditional mean of Ygiven X1, X2, . . . , Xq (see Appendix B). The regression coefficient for Xi is pro-portional to the partial correlation between Y and Xi. For example,

b1 =sY X1·X2...Xq

s2X1·X2...Xq

=rY X1·X2...XqsY ·X2...XqsX1·X2...Xq

s2X1·X2...Xq

= rY X1·X2...Xq

sY ·X2...Xq

sX1·X2...Xq

(4.2)


is the regression coefficient for X1. The other regression coefficients are definedsimilarity. Thus it is seen that the regression coefficients are directly propor-tional to their sample partial correlations. The standardized partial regressioncoefficient for X1 is defined by:

b1s = rY X1·X2...Xq

i.e., it is just the partial correlation between Y and X1.For the airpoll data, the partial regression coefficients can be computed

from the partial variances and covariances. For example, the following computesthe partial regression coefficient for NOX adjusted for Rainfall, Education,Popden, and Nonwhile.

> library(xtable)

> airpoll.lm <- lm(Mortality ~ NOX + Rainfall + Education + Popden + Nonwhite)

> airpoll.sum <- summary(airpoll.lm)

> xtable(airpoll.sum)

Estimate Std. Error t value Pr(>|t|)(Intercept) 1045.0740 100.7325 10.37 0.0000

NOX 0.0375 0.1383 0.27 0.7875Rainfall 1.1029 0.7540 1.46 0.1493

Education -20.2549 7.4486 -2.72 0.0088Popden 0.0085 0.0038 2.23 0.0302

Nonwhite 3.5913 0.6768 5.31 0.0000

Using the above formula, the partial regression coefficient can be computed fromthe partial covariance matrix.

> airpoll.pcor$partial.cov

NOX SO2 MortalityNOX 1452.40486 705.5700 54.42361SO2 705.57000 2845.2758 821.45832Mortality 54.42361 821.4583 1502.95189

> airpoll.pcor$partial.cov[1,3]/airpoll.pcor$partial.cov[1,1]

[1] 0.03747137

Partial correlations can be found from the negative inverse of R scaled tounit diagonal. Thus, the standardized regression coefficients can be computedby:

> R <- cor(cbind(Mortality, NOX, Rainfall, Education, Popden, Nonwhite))

> Rinv <- solve(R)

> D <- diag(1/sqrt(diag(Rinv)))


> StdBeta <- (D %*% (-1 * Rinv) %*% D)[1, 2:6]

> names(StdBeta) <- c("NOX", "Rainfall", "Education", "Popden", "Nonwhite")

> StdBeta

NOX Rainfall Education Popden Nonwhite0.03683587 0.19521992 -0.34704931 0.28998937 0.58541798

The standardized regression coefficients can be more easily found by the mult.corrfunction in mult.

> X <- cbind(NOX, Rainfall, Education, Popden, Nonwhite)

> y <- Mortality

> airpoll.mcorr <- mult.corr(X, y)

> names(airpoll.mcorr$partial.corr) <- c("NOX", "Rainfall", "Education", "Popden", "Nonwhite")

> airpoll.mcorr$partial.corr

NOX Rainfall Education Popden Nonwhite0.03683587 0.19521992 -0.34704931 0.28998937 0.58541798

Inferences concerning the linear relationship between Y and Xi, adjustedfor the other Xj , reduce to inferences concerning their partial correlation. Forexample, consider testing that no linear relationship exists between Y and Xi.This translates to the following hypothesis:

H0 : ρY Xi·X1...Xi−1Xi+1...Xq = 0

where ρY Xi·X1...Xi−1Xi+1...Xq is the population partial correlation between Yand Xi adjusted for the other Xi’s. This can be tested by the following t-test:

t =√n− q − 1

|rY Xi·X1...Xi−1Xi+1...Xq |√1− r2

Y Xi·X1...Xi−1Xi+1...Xq

where rY Xi·X1...Xi−1Xi+1...Xq is the sample partial correlation between Y andXi after adjusting for the other Xj . The two-sided rejection region is given by:|t| > tα/2,n−q−1. The case of simple linear regression is a special case (q = 1).Then rY X is the simple correlation, and n− q − 1 = n− 2.

The p-values for testing the partial correlations in the regression problemare:

> names(airpoll.mcorr$p.partial) <- c("NOX", "Rainfall", "Education", "Popden", "Nonwhite")

> airpoll.mcorr$p.partial

NOX Rainfall Education Popden Nonwhite7.875229e-01 1.493428e-01 8.780076e-03 3.016193e-02 2.146943e-06

These tests are equivalent to the standard output from statistical regressionpackages. The similarity stops here, however. The power of these tests (based


on the alternative distribution), tests of partial correlations other than a nullvalue of 0, and diagnostics are all different. For example, the test of:

H0 : ρY Xi·X1...Xi−1Xi+1...Xq = ρ0

is not based on the t-test. Fisher’s z is commonly used. Let:

z =12

log1 + rY Xi·X1...Xi−1Xi+1...Xq

1− rY Xi·X1...Xi−1Xi+1...Xq

and

ζ0 =12

log1 + ρ0

1− ρ0.

Then√n− q − 1(z − ζ0) is distributed approximately N(0, 1). The rejection

region is given by:|√n− q − 1(z − ζ0)| > zα/2.

4.1.2 Multiple Correlation

The next topic concerns the overall relationship between Y and X1, X2, . . . , Xq.Consider finding the variable U in X-space such that

corr2(U, Y ) =cov2(U, Y )

var(U)var(Y )

is maximized, where U = b′X. Since cov(U, Y ) = b′sXY and var(U) = b′SXb:

corr2(U, Y ) =b′sXY s′XY bb′SXbs2

Y

=1s2Y

b′sXY s′XY bb′SXb

. (4.3)

Maximization of Equation 4.3 is found by solving the generalized eigenvalueproblem:

(1/s2Y )sXY s′XY b = lSXb.

This can be solved explicitly by:

b = S−1X sXY

i.e., the sample regression coefficient vector. Since sXY s′XY is of rank 1, thisis the eigenvector corresponding to the single (and thus the largest) positiveeigenvalue. All other eigenvalues are zero.

The eigenvalue, corresponding to the eigenvector b, is equal to corr2(U, Y ).Therefore,

corr2(U, Y ) =1s2Y

(s′XY S−1X sXY )(s′XY S−1

X sXY )s′XY S−1

X SXS−1X sXY

=s′XY S−1

X sXYs2Y

= r2Y ·X . (4.4)


The positive square root of Equation 4.4 is called the sample multiple cor-relation, and it is denoted by rY ·X . Since it is a correlation and non-negative,0 ≤ rY ·X ≤ 1. The estimate of ρY ·X , the population multiple correlation, isgiven by

rY ·X =

√s′XY S−1

X sXYsY

.

The sample conditional variance of Y given the Xi, assuming a multivariatenormal distribution, is

s2Y − s′XY S−1

X sXY .

Thus, the ratios2Y − s′XY S−1

X sXYs2Y

is the proportion of the sample variance “unexplained” by X. Therefore, r2Y ·X

can be interpreted in terms of the proportion of the variation “explained” by theregression since

1−s2Y − s′XY S−1

X sXYs2Y

=s′XY S−1

X sXYs2Y

= r2Y ·X .

Also notice that

s2Y ·X = s2

Y − s′XY S−1X sXY = (1− r2

Y ·X)s2Y .

Now 0 ≤ r2Y ·X ≤ 1, and if r2

Y ·X = 1, then sY ·X = 0, i.e., Y is in the X-space.An important global hypothesis is

H0 : ρY ·X = 0

or equivalentlyH0 : β = 0 (σXY = 0),

i.e., testing ρY ·X = 0 is equivalent to testing that the regression coefficients areall zero. This is tested by

F =(n− q − 1)

q

r2Y ·X

1− r2Y ·X

∼ Fq,n−q−1.

The rejection region is given by F ≥ Fα;q,n−q−1. This is equivalent to the classi-cal regression test for fixed Xi. However, the distribution under the alternativeis not the noncentral F .

For the regression problem above, the estimated multiple correlation andP -value are:

> airpoll.mcorr$mult.corr

[1] 0.7824359


> airpoll.mcorr$p.mult

[1] 4.365887e-10

Compare this to the square root of:

> airpoll.sum$r.squared

[1] 0.612206

from the lm function above.

4.2 Canonical Correlation Analysis

Canonical correlation analysis is concerned with the relationships betweenX1, X2, . . . , Xq

and Y1, Y2, . . . , Yp. Consider finding the variable U in X-space and V in Y -spacesuch that

corr2(U, V ) =cov2(U, V )

var(U)var(V )

is maximized where U = b′X and V = a′Y .The sample covariance matrix is given by:

S =[

SX SXYS′XY SY

]Since cov(U, V ) = b′SXY a, var(U) = b′SXb, and var(V ) = a′SY a. Then

corr2(U, V ) =b′SXY aa′S′XY b

b′SXba′SY a.

Let a = S−1Y S′XY b. Then corr2(U, V ) reduces to

corr2(U, V ) =b′SXY S−1

Y S′XY bb′SXb

. (4.5)

Maximizing Equation 4.5 is equivalent to solving the generalized eigenvalueproblem

SXY S−1Y S′XY b = lSXb. (4.6)

Let S1/2X be the symmetric square root of SX and S−1/2

X be the symmetricsquare root of S−1

X . Then the generalized eigenvalue problem can be solved by:

S−1/2X SXY S−1

Y S′XY S−1/2X (S1/2

X b) = l(S1/2X b),

i.e., we can find the eigenvalues of a symmetric matrix

S−1/2X SXY S−1

Y S′XY S−1/2X .


Let s = min(p, q). Then let l1 ≥ l2 ≥ · · · ≥ ls be the ordered eigenvaluesand b1,b2, . . . ,bs be the corresponding eigenvectors which solve Equation 4.6.Using a similar development, a1,a2, . . . ,as are the eigenvectors which solve thefollowing generalized eigenvalue problem:

S′XY S−1X SXY a = lSY a. (4.7)

Now define Ui = b′iX and Vi = a′iY . U1 and V1 satisfy the condition ofhaving maximum correlation with U1 in the X-space and V1 in the Y-space,i.e., corr(U1, V1) =

√l1 = r1 defines the maximum correlation and, in general,

corr(Ui, Vi) =√li = ri. (Ui, Vi) is called the ith pair of canonical variables and

the corresponding ri is called the ith canonical correlation coefficient.Canonical variates can be computed from either the centered data matrices,

Xc and Yc, or the standardized data matrices, Xs and Ys. The canonicalcorrelations do not depend on the standardization. However, the canonicalvariable coefficients do depend on the scaling. Specifically, the coefficients for thestandardized variables in the X-space are D1/sXb and those for the standardizedvariables in the Y -space are D1/sY a.

The population canonical variables and correlations are defined by replacingS by Σ. For example, the population-based generalized eigenvalue problem is:

ΣXY Σ−1Y Σ′XY β = λΣXβ.

The population canonical variables are given by Ui = β′iX and Vi = α′iY ,where β and α are the corresponding population eigenvectors. Likewise, ρi =√λi is called the ith population canonical correlation.

An important problem is to determine the number of canonical variate pairswhich are needed to explain the covariance structure between the X-space andthe Y -space. Specifically, if t ≤ s is the number of nonzero canonical correla-tions, then the dimensionality is t. Now consider testing the hypothesis that thelast s− t canonical correlations are zero, i.e,

H0 : ρt+1 = ρt+2 = · · · = ρs = 0.

Under the null hypothesis,

−[n− 12

(q + p+ 3)]s∑

i=t+1

log(1− r2i )

has an approximate χ2-distribution with (q−t)(p−t) degrees of freedom. Testingis generally done sequentially for t = 0, 1, . . . , s − 1 until the null hypothesis isaccepted. Alternately, the test can be conducted sequentially for t = s− 1, s−2, . . . , 0 until the null hypothesis is rejected.

The sample canonical variables can be found from the singular value decom-position of

S−1/2X SXY S−1/2

Y = B∗DriA∗′


where the right singular vectors A∗ satisfy

A = S−1/2Y A∗,

where A is the matrix of eigenvectors from Equation 4.7. Likewise,

B = S−1/2X B∗

are the eigenvectors of Equation 4.6.Once the effective dimensionality t is found, either by the above formal test

or by a scree plot of the eigenvalues, the canonical variables associated with thet largest eigenvalues can be evaluated. Let Bt and At be the matrices associatedwith the t largest eigenvalues in the X and Y space, respectively. Specifically,Ut = XcBt and Vt = YcAt are the scores on the first t canonical variate pairs.Example: Variables determined for each of 60 U.S. metropolitan areas (Hen-derson and Velleman, 1981) is now explored.

> X <- airpoll[, 1:4]

> Y <- airpoll[, 5:6]

> can.corr(X, Y)

$can.corr[1] 0.5840712 0.4813982

$ChiSq[1] 14.63161 37.78859

$P.value[1] 2.160130e-03 8.236377e-06

$eff.dim[1] 2

$Bx.can1 x.can2

Rainfall 0.0980700468 0.0280741149Education 0.1770165294 -0.7169067689Popden -0.0003365544 0.0003211501Nonwhite -0.0586251828 0.0072485973

$Ay.can1 y.can2

NOX -0.015866011 0.01754642SO2 -0.006953357 -0.01583088

For reference, the following are the results from the built in function cancor;

> cancor(X, Y)


$cor[1] 0.5840712 0.4813982

$xcoef[,1] [,2] [,3] [,4]

Rainfall -1.276763e-02 3.654938e-03 9.163437e-03 -1.560656e-03Education -2.304559e-02 -9.333331e-02 1.496830e-01 -4.622270e-02Popden 4.381565e-05 4.181018e-05 6.840132e-05 1.423651e-05Nonwhite 7.632349e-03 9.436870e-04 -2.305663e-03 -1.387114e-02

$ycoef[,1] [,2]

NOX 0.0020655787 -0.002284349SO2 0.0009052499 0.002061005

$xcenterRainfall Education Popden Nonwhite37.36667 10.97333 3866.05000 11.87000

$ycenterNOX SO2

22.65000 53.76667

The eigenvalues are the same and the eigenvectors are proportional.Several plots are often constructed to explore the relationships between theX

and Y spaces. Scatterplots can be made between Ui versusVi for i = 1, 2, . . . , t.It is also possible to explore the relationships between the spaces spanned bythe columns of Ut and Vt using an xyPlot. These spaces are subspaces of theoriginal spaces spanned by the columns of Xc and Yc.

> par(mfrow=c(1,2))

> can.plot(X, Y)


●

●● ●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●●

●●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ● ●●

●

●

●

●

●●

−0.2 0.0 0.2

0.0

0.2

0.4

0.6

U1

V1

●

● ●●

●

●●

●

●●

●

●

●

●

●●●

●

●

●●

● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ● ●●●

●

●

●

●

●

●

−0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

0.4

U2

V2

The covariances or correlations between X1, X2, . . . , Xq and Y1, Y2, . . . , Ypcan be approximated by a biplot. Notice that

SXY = S1/2X B∗tDriA

∗′t S1/2

Y

≈ S1/2X B∗tDtA∗′t S1/2

Y

= S1/2X B∗tD

αt D1−α

t A∗′t S1/2Y

= GtH′t

where Gt = S1/2X B∗tD

αt and Ht = S1/2

Y A∗tD1−αt for 0 < α < 1. B∗t and A∗t

contain the first t columns of B∗ and A∗, respectively. If the q rows of Gt

and the p rows of Ht are plotted as vectors in t-dimensional space, then thecovariances (or correlations) between the variables in the X-space and those inthe Y -space are approximated by the angles between gi (i = 1, 2, . . . , q) andhj (j = 1, 2, . . . , p) . If the covariance structure between the X and Y spacesare explained by only a few canonical variables, then the covariances can beapproximated well in two or three dimensions.

4.3 Partial Correlation Analysis

The development of partial correlations can be extended to the general caseof multiple Yj . As shown in Appendix B the partial covariances are given by


SY ·Z = SY − S′ZY S−1Z SZY and the partial correlations can be obtained by

standardizing this matrix.The partial covariances can be computed by projecting the Y s into the Z

space. This is equivalent to regressing the Y s on the Zs and computing thecovariance matrix from the residuals. Let PZ be the projection matrix onto theZ space, i.e., PZ = Zc(Z′cZc)

−1Z′c. Then the residuals are (I − PZ)Yc. SincePZ and (I − PZ) are symmetric idempotent matrices, the residual covariancematrix is

Y′c(I−PZ)Yc = SY − S′ZY S−1Z SZY .

A similar development gives

S·Z =[

SX·Z SXY ·ZS′XY ·Z SY ·Z

]by projecting both Xc and Yc onto the space spanned by Zc. Partial canonicalcorrelations can be computed from this partial covariance matrix.

4.4 Assessing the Correlation Model

Correlation inferences are based on an underlying multivariate normal distribu-tion. However, multivariate normality may not be a tenable assumption. Gener-ally, it is best to assess the joint normality of X1, X2, . . . , Xq and Y1, Y2, . . . , Yp,although examination of the Yj or the Xi individually is often informative. Tech-niques for assessing univariate or multivariate normality are given in Section ??.

The univariate techniques are appropriate as a starting point, e.g., stem-and-leaf plots, box plots, and probability plots. The major emphasis, however,should be on the swarm of points in (q + p)-dimensional individual space. Thegamma probability plot of the squared Mahalanobis distances is perhaps mostuseful.

Outlying values in the (q + p)-dimensional individual space suggest robustanalyses. Nonlinear patterns of the probability plot might be remedied by trans-forming along selected dimensions. If transformations are applied successfully,the analyses discussed in the preceding sections are applicable. This assumesjoint normality has been obtained–not just marginal normality.

The correlation analyses are generally computed in terms of the unbiasedestimate of the population covariance matrix given by:

S =[

SX SXYS′XY SY

].

The maximum likelihood estimator of Σ is Σ =(n−1n

)S. Canonical correlations

can be based on either S or Σ.Robust analyses are easily done by replacing S by the M-estimate S∗ as

shown in Section ??.

Chapter 5

Principal ComponentAnalysis

Principal Component Analysis (PCA) is used to determine the structure of amultivariate data set composed of numerical variables. Specifically, the purposesof PCA are:

1. to determine the “effective” dimensionality of variable space;

2. to find the linear combinations of the original variables which account formost of the variation in the multivariate system;

3. to visualize the relationships among the observations and variables;

4. to determine derived variables which contain the essential multivariateinformation as a first step for subsequent analyses;

5. to identify multivariate outliers;

6. to determine if population structure is present;

7. to examine the relationship between the principal variables and the inter-nally, or externally, defined explanatory variables;

8. to explain the residual variation after adjusting for conditioning variables.

PCA is not a modeling technique, but it is often used in the modeling processto learn about the internal structure of a data set. To a large extent it isgraphical, since insight into the structure of a dataset using PCA is greatlyenhanced by plots.

41

CHAPTER 5. PRINCIPAL COMPONENT ANALYSIS 42

5.1 Basic Concepts

5.1.1 Sample Principal Variables

Let Y1, Y2, . . . .Yp be numerical variables whose values are centered about theirrespective means.1 The object is to find derived variables, V1, V2, . . . , Vt (t ≤ p),such that the Vj are uncorrelated and have successively smaller variances, i.e.,

var(V1) ≥ var(V2) ≥ · · · ≥ var(Vt).

The Vj are in variable space and are called principal variables. The developmentis given initially in terms of the sample covariance matrix. Generalizations arethen given.

The first principal variable is the linear combination of the Yi with maximumvariance. Define

V1 = a11Y1 + a21Y2 + · · ·+ ap1Yp

= a′1Y,

such that var(V1) = a′1Sa1 is maximized with respect to a1. But var(V1) can bemade arbitrarily large by choosing a1 such that ‖a1‖ is large. We thus normalizea1 such that ‖a1‖2 = a′1a1 = 1. Thus, the coefficients of V1 are found from

maxa

(a′Sa)

subject to a′a = 1, or equivalently,

maxa

(a′Saa′a

).

The maximum is l1, the largest eigenvalue of S. The corresponding nor-malized eigenvector is a1. Thus the eigenvector corresponding to the largesteigenvalue determines the first principal variable and

var(V1) = a′1Sa1 = l1.

The next problem is to determine the normalized linear combination

V2 = a′2Y,

which has the largest variance in the class of all normalized components or-thogonal to V1 (i.e., constrained by a′1a2 = 0). Geometrically, the axes areperpendicular. The maximum variance is l2, the second largest eigenvalue of S.The corresponding normalized eigenvector is a2.

The process can be continued until t ≤ p principal variables are found. Thejth principal variable is defined by

Vj = a′jY.

1The centered variables would more appropriately be denoted by Y cj , but this notation is

not used unless it clarifies the context, e.g., to compare centered and uncentered analyses.


Its variance is lj , the jth largest eigenvalue of S. This last result follows fromthe eigenvalue problem, since

var(Vj) = a′jSaj = lja′jaj = lj , (5.1)

where aj is the normalized eigenvector corresponding to lj . Also,

cov(Vj , Vk) = a′jSak = 0 (5.2)

for i 6= j, since a′jSak = lka′jak = 0 by the constraint. Thus, V1, V2, . . . , Vt areuncorrelated and they are ordered by decreasing variability.

PCA reduces analytically to finding the eigenvalue (spectral) decompositionof S, which, from Equations 5.1 and 5.2, is given by

S = ADljA′, (5.3)

where the eigenvectors are the columns of A and the eigenvalues are the diago-nal elements of Dlj (a diagonal matrix). However, the eigenvalue decompositiondoes not provide the values of the principal variables directly, nor is it the rec-ommended numerical solution. The singular value decomposition is numericallymore stable and it provides more information.

The centered data matrix2 is given by

Yc =1√n− 1

(Y − Y). (5.4)

The singular value decomposition of the centered data matrix is then:

Yc = VDdjA′, (5.5)

where d1 ≥ d2 ≥ · · · ≥ dp ≥ 0 are the singular values, and the columns of Vand A are the left and right singular vectors, respectively. From Equations 5.4and 5.5,

S = Y′cYc = ADd2jA′ = ADljA

′,

since V is orthonormal. Thus, the right singular vectors of Yc are the eigenvec-tors of S, and the singular values are the square roots of the eigenvalues. Also,the values of the prinicpal variables are given by

YcA = VDdj , (5.6)

i.e., the jth column of VDdj gives the centered values of the jth principal vari-able. The standardized values of the jth principal variable are given by:

YcAD−1dj

= V, (5.7)

since V is orthonormal. The principal variable with values given by the jth

column of VDdj is denoted by 1Vj , whereas the principal variable with values

2The centered data matrix is scaled by√

n− 1 to simplify the interpretation of the subse-quent matrix decomposition. Using the “scaled” centered data matrix, Y′

cY = S.


given by V is denoted by 0Vj . In the first case, var(1Vj) = d2j = lj , whereas

in the second, var(0Vj) = 1. However, when the values of these variables areplotted in variable space, the interpretations of the interpoint distances differdepending on the scaling. This representation, together with a generalized formof principal variables, will be discussed in Section 5.3.

Many variants of principal component analysis are possible. The most com-mon is to center and standardize the dataset. Let:

Ys =1√n− 1

(Y − Y)D1/sj , (5.8)

where D1/sj is diagonal with the reciprocals of the standard deviations on thediagonal. The jth standardized variable is denoted by Y sj = Y cj /sj , and has asvalues the jth column of Ys.

The singular value decomposition of the standardized centered data matrix(Equation 5.8) is given by

Ys = VsDdsjA′s, (5.9)

where the columns of Vs are the left singular vectors, the columns of As arethe right singular vectors, and the dsj are the singular values.

Doing a singular value decomposition on Ys is equivalent to performing aneigenvalue decomposition on the sample correlation matrix R. This follows sinceR = Y′sYs; the decomposition is

R = AsDlsjA′s,

where lsj = (dsj)2.

The sample correlation matrix is the sample covariance matrix of the Y sj .For consistency, let V sj (j = 1, 2, . . . , p) denote the principal variables based onthe standardized variables. As before, the principal variable with sample valuesgiven by the jth column of VsDdsj

is denoted by 1V sj , whereas the principalvariable with sample values given by the jth column of Vs is denoted by 0V sj .

5.1.2 Robust Principal Variables

Outliers in individual space will cause y and S to be poor estimators of µ andΣ. As a result, the principal variables may depend unduly on a few influentialindividuals.

The remedy is to perform the PCA on S∗, the robust covariance matrix asdefined in Section 2.4. The resulting principal variables are defined by

V ∗j = a∗′

j Y∗,

where a∗′

j is the eigenvector associated with the eigenvalue l∗j of S∗, and Y ∗itakes as values the values of Yi times the final Huber weights. These robustprincipal variables are not influenced by a few outlying values.


The interpretation of the V ∗j is the same as Vj in the classical case. Therobust correlations between the Y ∗i and the V ∗j are given by

corr(Y ∗i , V∗j ) = a∗ij

√l∗js∗i

It is also possible to perform a robust PCA on R∗, the robust covariance matrixcomputed form S∗.

5.1.3 Population Principal Variables

The population principal variables are defined by replacing S by Σ. The eigen-value problem then becomes:

Σα = λα.

The population principal variables are given by Vj = α′jY , where αj is thenormalized eigenvector corresponding to the jth largest eigenvalue λj . In thiscase, Y and V should be interpreted as random variables having a probabil-ity distribution. Multivariate normality is generally assumed, particularly ifinferences concerning the eigenvalues are desired.

All of the above derivations are valid with population parameters replacingthe corresponding sample quantities.

[This discussion needs to be expanded to include inferences about the λj .]

5.2 Examining Principal Variables

5.2.1 Interpreting Principal Variables

The eigenvectors are used to determine the contribution of each of the originalvariables to the principal variables. The jth sample principal variable is definedby:

Vj = a1jY1 + a2jY2 + · · ·+ apjYp

= a′jY.

However, the aij by themselves do not measure the contribution of Yi to Vjsince the measurement scales of the Yi may be arbitrary. The scale dependencemust be removed.

The correlations between the original variables and the sample principalvariables will provide such a measure.

cov(Yi, Vj) =[0 · · · 1 · · · 0

]Saj

= lj[0 · · · 1 · · · 0

]aj

= ljaij .


Therefore,

corr(Yi, Vj) =ljaij

si√lj

= aij

√lj

si. (5.10)

Thus we must rescale aij by multiplying by the ratio of the sample standarddeviation of Vj to that of Yi. For the standardized case si = 1 for all i, and thecorrelations are given by asij

√lsj . Notice that the correlations of the Yi with Vj

is the eigenvector aj times the square root of the jth largest eigenvalue.The jth principal variable is interpreted by classifying the Yi according to

their influence on Vj , as given by the correlations in 5.10. Yi with |corr(Yi, Vj)| ≤0.25 can generally be ignored. If V1 (or some other principal variable) has mean-ingful correlations of the same sign and magnitude with a set of Yi, and a lowcorrelation with all other Yj , then V1 is said to define an “index” based on the Yivariables. In other cases, the meaningful correlations are of similar magnitude,but the sign of the variables in one group is opposite that of the variables inanother group. The principal variable is then said to define a “contrast.”

Principal variables can also be interpreted by relating the Vj to externalvariables. Let X1, X2, . . . , Xq be explanatory variables. Then insight into theprincipal dimensions may be gained by the correlations between the Xi and theVj . However, nothing guarantees any meaningful interpretation, since the Xi

were not part of the analysis. Formally incorporating the Xi into PCA is donein Section 5.5.1, and is called canonical PCA.

The sample principal variables are not invariant to scale changes in theoriginal variables, e.g., it is not possible to obtain V sj from Vj , or vice versa.An analysis of the sample covariance matrix is preferred if the variances of theoriginal variables are comparable. However, this is rarely true even if the Yiare based on measurements on the same underlying scale, e.g., lenghts of skullmeasurements in cm. Suppose

var(Yi)� var(Yj),

for j 6= i. Then V1 ≈ kYi for some scalar k, i.e., V1 is dominated by a sin-gle variable. The remedy is to scale the Yi so that the scaled variables are“comparable.”

Standard deviations are the most commmon scaling. More generally, selects1, s2, . . . , sp such that a unit change in Yi/si is comparable to a unit changein Yj/sj , where the si may or may not represent standard deviations. Theresearcher can scale the Yi in any meaningful way, but as noted above eachscaling defines a new set of principal variables in variable space. For example,measurement theory, together with subject matter knowledge, may suggest ascale independent of the data. In this case, the standard deviation is not theappropriate scaling. The researcher is urged to consider the scaling carefullyand not to automatically use sample standardized variables.


5.2.2 Determining Dimensionality

The eigenvectors corresponding to the large eigenvalues determine the dimen-sions containing most of the multivariate variability. More specifically, the eigen-values determine the proportion of the total variance explained by the sampleprincipal variables. The total variance is defined by:

tr(S) =p∑i=1

s2i =

p∑j=1

lj .

Thus, the proportion of the total variability explained by Vj is lj/∑lj , and

the proportion explained by the first t principal variables is

t∑j=1

lj/

p∑j=1

lj .

These results are true, using the appropriate eigenvalues, whatever the scalingof the original variables. In the case of correlations, the denominator is p sincetr(R) =

∑lsj = p.

By examining the proportion of the variation explained, the important sam-ple principal variables are identified. But how many should be retained? The“scree”plot gives a graphical representation by plotting j versus lj . Eigenvalues (di-mensions) are retained until the “scree” plot levels out at a low value, or until asufficient amount of the variability, e.g., 90%, is explained.

An alternate strategy is to retain only those V sj , based on standardized data,whose eigenvalues exceed 1. Since the variances of the Y sj are all 1, this ensuresthat the variances of the standardized principal variables are at least as greatas the variances of the Y sj .

The principal variables corresponding to near zero eigenvalues define dimen-sions in which little variability is present. Suppose lt+1 ≈ lt+2 ≈ · · · ≈ lp ≈ 0,i.e., the last p − t eigenvalues of S are nearly zero. Thus the sample variancesof Vt+1, Vt+2, . . . , Vp are nearly zero. Therefore,

Vj = a1jY1 + a2jY2 + · · ·+ apjYp ≈ 0.

This results from the fact that var(Vj) ≈ 0⇒ Vj ≈ 0 since Vj is centered about0. We thus have near redundancies. If a1j 6= 0, then

Y1 ≈ −a2j

a1jY2 − · · · −

apja1j

Yp.

Thus we can solve approximately for Y1 in terms of the other Yi. If the last p− teigenvalues are near 0, then p − t of the Yi can be expressed approximately interms of the other Yk. These near “singularities” or collinearities are generallydefined to give substantive meaning. Dropping terms which have “near” zerocoefficients often simplifies the interpretation.

In essence, we are saying that the major variation lies in a t-dimensionalsubspace of variable space. The (p− t)-dimensional subspace orthogonal to this


contains little variation. Thus the near “singularites” determine a space whichcontributes little information. The dimensionality of this space is invariant toscale changes.

The effective dimensionality is invariant to the scale of the Yj , i.e., the num-ber of eigenvalues “near” zero does not depend on the scaling. [Discuss why thisis so.]

5.2.3 Viewing Principal Variables

The principal variables are equivalent geometrically to the principal axes of theconcentration ellipsoids determined by:

(y − y)′S−1(y − y) = k,

for some positive constant k. [Discuss orientation and magnitude and the prob-ability approximations.]

5.2.4 Fitting Principal Variables

Principal component analysis can be viewed as a method of fitting subspaces ofdim t ≤ p to the data. Consider the case in which p = 2. Let the orthogonaldistance from yi to the coordinate defined by V1 be di1. The eigenvector associ-ated with the largest eigenvalue can be found by minimizing

∑d2i1. Notice that

di1 is equal to the projection of yi onto V2. Thus,

d2i1 = [a′2(yi − y)]2.

This process can be repeated. In general,

d2it =

p∑j=t+1

[a′j [(yi − y)]2

i.e., d2it is the lack-of-fit of the ith individual from the t-dimensional space

spanned by V1, V2, . . . , Vt.A t-dimensional subspace may account for most of the variation in a system.

Nonetheless, certain yi may not lie near this subspace as indicated by larged2it. Outliers in the (p− t)-dimensional space orthogonal to V1, V2, . . . , Vt can be

identified by a gamma probability plot. The d2it approximately follow a gamma

distribution with a shape parameter which must be estimated from the data(i.e., the shape parameter is not (p− t)/2).

5.3 Plotting Principal Variables

Traditionally, the values of the principal variables given by Equation 5.6 areplotted as n points in t-dimensional space (often two at a time). These plotsare informative, but the positions of the observations are misleading since the


interpoint distances represent Euclidean—not Mahalanobis distances. Also, theaxes may be interpretable in terms of the “important” original variables, butindividuals may not be easily interpretable in terms of which variables “affect”them. The biplot overcomes these objections by providing a t-dimensional rep-resentation of both the individuals and variables.

Principal component analysis was developed in terms of an inner productdefined over variable space. Specifically, the sample covariance and correlationmatrices are the most commonly used inner products. The principal variablescan be defined equivalently in terms of the singular value decomposition of thecentered, or centered and scaled, data matrix (Equations 5.5 and 5.9). Thisdevelopment naturally leads to a class of graphs called principal componentbiplots.

The development of biplots will be done here in terms of the centered datamatrix, which is equivalent to an eigenanalysis of the covariance inner product,but the procedure is appropriate for any scaling. The singular value decompo-sition of the centered data matrix is given by Equation 5.5. Now consider thefactorization:

VDdjA′ = GH′,

where G = VDαdj and H′ = D1−α

djA′ for 0 ≤ α ≤ 1. Although α can be

any value between 0 and 1, the most common and interpretable choices areα = 0, 1/2, and 1. Every choice of α determines a principal variable basis,αVj (j = 1, 2, . . . , p), which generalizes the bases defined by the 1Vj and 0Vj , asdefined in Section 5.1.1. The values of αVj are given by the elements of the jth

column of VDαdj .

The biplot consists of plotting gi, i = 1, 2, . . . , n, the rows of G, and hi, i =1, 2, . . . p, the rows of H, on the same plot. The difficulty is that the resulting plotis p-dimensional, which is not particularly useful unless p is 2 or 3. However, thefirst t singular values, and the corresponding singular vectors, often give a goodapproximation to the centered data matrix Yc. Therefore, let git and hit bevectors with elements consisting of the first t elements of gi and hi, respectively.If the squares of the first t singular values explain a large proportion of thevariation, then a t-dimensional approximation will be good. The interpretationof these points depends on the choice of α.

First, consider the case of α = 0 for which G = V and H′ = DdjA′. This

correspond to the principal variables 0Vj , j = 1, 2, . . . , p. Taking the cross-product of the centered data matrix gives

S = ADd2jA′ = HH′.

As a result sij = hih′j ≈ hith′jt. The p “variable” points, with coordinatesdetermined by the elements of hit, generally are graphed as rays to distinguishthem from the individual points. These rays, representing the original variablesin the t-dimensional biplot, reproduce the covariance matrix in the followingsense: the lengths of the rays approximately equal the standard deviations ofthe original variables and the cosines of the angles correspond to the correlationsamong these variables.


The individuals are represented by the git, which represent the centeredvalues of the first t principal variables. The interpoint distances approximatelyequal Mahalanobis distances, which is a more appropriate measure of“closeness.”This follows, since

(yi − yj)′S−1(yi − yj) = (gi − gj)′(gi − gj)≈ (git − gjt)′(git − gjt)

after algebraic simplification. The interpoint distances among the standardizedprincipal variable values in t-dimensional space thus represent an approximationto the Mahalanobis distances.

Next consider the case of α = 1 for which G = VDdj and H′ = A′. Thiscorrespond to the principal variables 1Vj , j = 1, 2, . . . , p. The biplot here issimply the principal variable values for each of the n individuals augmentedby the principal variable coefficients for each of the p original variables. Theelements of git give the coordinates representing the values of the first t principalvariables. Their interpoint distances represent Euclidean distances which givesequal weight to each of the original variables. The elements of hit give thecoefficients of Yi for the first t principal variables.

5.4 Generalized PCA

The singular value decomposition of Yc is given by Equation 5.5. This devel-opment can be generalized by relaxing the form of the orthogonality constaint.Consider the decomposition

Yc = VDdjA′,

where V′WV = I and A′SA = I with W and S specified positive definitematrices. The resulting decomposition is called the generalized singular valuedecomposition. In nearly all cases, W and S are diagonal with elements w2

i ands2j , respectively.

This generalized problem can be rewritten as

W1/2YcS1/2 = W1/2VDdjA′S1/2

= VsDdjA′s,

where Vs = W1/2V and As = S1/2A. Using this representation, V′sVs = Iand A′sAs = I and thus the regular singular value decomposition algorithmscan be used.

The principal variables could be defined by the columns of either A or As.As defines orthogonal principal variables based on the scalings imposed by S1/2

and the row weights imposed by W1/2. On the other hand, A defines obliqueprincipal variables, i.e., the principal variables are correlated.

The above generalization allows the user to choose row weights wi and col-umn scalings sj . For example, if wi = 1 for all i and sj = 1√

var(Yj), then a

standardized PCA results.


Example: Variables determinded for each of 60 U.S. metropolitan areas (Hen-derson and Velleman, 1981) is now explored using PCA.

> library(mult)

> data(airpoll)

By choosing α = −1, the plot is surpressed.

> airpoll.pca <- pca(airpoll[, 5:6], alpha = -1)

[1] "Alpha must be between 0 and 1"

> airpoll.pca$pca.var

[1] 1.4094 0.5906

> airpoll.pca$var.p

[1] 0.7047 0.2953

Notice that only the first component has an eigenvalue (variance) greater than1. Nonetheless, the first component only accounts for about 70

> airpoll.pca$pca.coeff

V1 V2NOX 0.9966 -0.4176SO2 0.9966 0.4176

> airpoll.pca$pca.corr

V1 V2NOX 0.8395 -0.5434SO2 0.8395 0.5434

The first principal variable is a “general” pollution variable, whereas the secondprincipal variable is a contrast between the two pollution variables.

> airpoll.pca <- pca(airpoll[, 5:6])


−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

PC 1

PC

2 akOHalNYalPAatGA

blMD

brALbsMAbrCT

bfNYcntOHchTN

chIL

cnnOH

clvOHclmOHdlTXdyOHdnCO

dtMI

flMIftTXgrMIgrNChrCThsTXinIN

knMOlnPA

lsCA

lsKY

mmTNmmFL

mlWI

mnMN

nsTN

nwCTnwLA

nwNY

phPA

ptPA

prORprRI

rdPArcVA

rcNYstMO

sndCA

snfCA

snjCA

stWAspMAsyNYtlOHutNY

wsDC

wcKSwlDE

wrMAyrPA

ynOH

−0.5 0.0 0.5

−0.

50.

00.

5

NOX

SO2

The biplot shows the relationship between the pollution variables and the metropoli-tan areas. Notice that certain California cities are aligned with NOX, whereasmost are aligned with SO2.

5.5 Constrained PCA

Principal component analysis is concerned with understanding the variationin Y -space. The principal variables are often correlated post hoc with exter-nally defined variables. However, nothing guarantees that principal variableswill be related to these variables, denoted here by X1, X2, . . . , Xq. An impor-tant variation on PCA integrates the X1, X2, . . . , Xq directly into the analysis.Another variation of PCA derives principal variables based on appropriatelydefined residual variables.

Specifically, three problems will be addressed:

1. Determine the linear combinations in the X-space that explain the largestproportions of the variability of Y1, Y2, . . . , Yp. The resulting analysis iscalled canonical PCA or redundancy analysis.

2. Find the linear combinations with maximum variances in the componentof the Y -space orthogonal to the Z-space. The resulting analysis is calledpartial PCA.


3. Determine the linear combinations in the component of the X-space or-thogonal to the Z-space that explain the largest proportions of the vari-ability of variables in the Y -space orthogonal to the Z-space. The resultinganalysis is called partial canonical PCA.

The following subsections discuss these in turn.

5.5.1 Canonical Principal Variables

A common objective is to find variables in the X-space which maximally explainthe variation of the Yj considered jointly. Let U1, U2, . . . , Ut(t ≤ s = min(p, q)be linear combinations of X1, X2, . . . , Xq which explain successively less of thevariability of the Yj .

Specifically, let

r2U1Yj = corr2(U1, Yj) =

cov2(U1, Yj)var(U1)var(Yj)

.

The variation of Yj explained by U1 is var(Yj)r2U1Yj

. U1 = b′X is the variable inX-space that maximizes the “total variation explained” defined by maximizing∑

var(Yj)r2U1Yj

=∑ cov2(U1,Yj)

var(U1) . U2 is the variable in X-space orthogonal toU1 which explains the next highest total variation. The process is continuediteratively.

Let SX = X′cXc and SXY = X′cYc. The problem reduces to finding U =Xcb such that

U′YcY′cUU′U

is maximized. Substituting U = Xcb gives

b′X′cYcY′cXcbb′X′cXcb

=b′SXY S′XY b

b′SXb.

Maximization is satisfied by the generalized eigenvalue problem

SXY S′XY b = lSXb,

or equivalently

S−1/2X SXY S′XY S−1/2

X (S1/2X b) = l(S1/2

X b).

Let b∗ = S1/2X b. Notice that this form is equivalent to the canonical correlation

analysis with SY = I (Equations 4.5 and 4.6).The values for the “constrained” principal variables, U1, U2, . . . , Ut, are given

by XcB. These variables can be plotted against the original Yj or their projec-tions on the Y -space.

In terms of the singular value decomposition we have

S−1/2X SXY = B∗DriA

∗′.


This decomposition allows a biplot to be made. The coordinates for the pointsrepresenting the Y -space are A∗, whereas the coordinates for the X-space areSXBDdri

. Generally, only the first t columns of these matrices are used.A variation on canonical PCA allows both constrained and unconstrained

principal variables to be fit. Suppose t is specified in advance. Then U1, U2, . . . , Utdefine constrained principal variables. Unconstrained principal variables canthen be found in the component of the Y -space orthogonal to the projection ofthe Uj onto Y . This is a type of partial analysis which is discussed next.

Example (cont): Variables determinded for each of 60 U.S. metropolitan areas(Henderson and Velleman, 1981) is now explored using canonical PCA.By choosing α = −1, the plot is surpressed. The first canonical variable (U1)explains nearly 80% of the variation in the pollution variables.

> library(mult)

> data(airpoll)

> airpoll.can <- pca.can(airpoll[, 1:4], airpoll[, 5:6], alpha = -1)

[1] "Alpha must be between 0 and 1"

> airpoll.can$U.var

[1] 1485.2251 382.2194

> airpoll.can$U.p

[1] 0.7953249 0.2046751

> airpoll.can <- pca.can(airpoll[, 1:4], airpoll[, 5:6])

> airpoll.can$U.coeff

U1 U2Rainfall -0.5501088 0.78248274Education -0.2637226 -0.46977688Popden 0.6996159 0.40343408Nonwhite 0.3719788 0.06535579

> airpoll.can$U.corr

U1 U2NOX 0.4424818 -0.3572919SO2 0.5147923 0.1640684


−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

−0.

6−

0.4

−0.

20.

00.

20.

40.

60.

8

1st Dimension

2nd

Dim

ensi

on

Rainfall

Education

Popden

Nonwhite

−20 −10 0 10 20 30

−20

−10

010

2030

NOX

SO2

The first canonical variable is a “general” pollution variable, whereas the secondcanonical variable is a contrast between the two pollution variables.

5.5.2 Partial Principal Variables

Understanding the residual variation among Y1, Y2, . . . , Yp after linearly adjust-ing for Z1, Z2, . . . , Zr is often of interest. The resulting analysis is a straightfor-ward modification of the standard principal component analysis. Simply replacethe covariance matrix SY by the partial covariance matrix SY ·Z . Likewise, astandardized analysis is done on the partial correlation matrix.

The analysis can also be based on the singular value decomposition of theresidual Y variables, denoted by Y rj . Let PZ be the projection matrix into theZ-space. The residual matrix is then defined by (I − PZ)Yc and is denotedby Yr

c . The singular value decomposition is carried out on this latter matrixfollowing the usual procedures.

5.5.3 Partial Canonical Principal Variables

The canonical principal variables can be derived after adjusting both the Y -space and X-space for Z1, Z2, . . . , Zr. This is accomplished by replacing SXby SX·Z and SXY by SXY ·Z in Section 5.5.1. The resulting analysis generatespartial canonical principal variables.


Alternately, the analysis can begin from Yrc and Xr

c , the residual data ma-trices for both the Y and the X variables, respectively.

Chapter 6

Correspondence Analysis

Correspondence Analysis (CA) is used to visually represent the relationshpsbetween the rows and columns of a matrix of counts. The matrix of countsarises from two fundamentally different models: one based on numerical countvariables and the other on categorical variables.

In the first model, the count values of the numerical variables are arrangedin a multivariate dataset. The underlying probability model for the variablesis typically the Poisson distribution, although the negative binomial and bino-mial are also frequently assumed. As an example, community ecologists oftenmeasure the abundances of species on each of n sites. The number of species,which define the p variables, is typically large and generally exceeds the numberof sites.

The second model is developed in terms of the rows and columns of a tableof counts, in which the rows represent the r possible values of a categorical vari-able and the the columns represent the c possible values of another categoricalvariable. The resulting r × c table consists of counts of the joint occurrences ofthe possible values of these variables. This representation was the original ideabehind CA.

Correspondence analysis can be generalized in two ways. First, both themultivariate and contingency forms can be constrained by requiring the rela-tionships be expressed in terms of covariates. That is, the relationships betweenthe rows and columns of the dataset (table), which are attributable to the co-variates, are modeled.

The second generalization allows multidimensional contingency tables to beanalyzed. Specifically, the relationships among three or more categorical vari-ables can be visualized.

6.1 Basic Concepts

The development of correspondence analysis will be given in terms of the multi-variate count variables followed by a discussion of the similarities and differences

57

CHAPTER 6. CORRESPONDENCE ANALYSIS 58

with CA based on categorical variables.

6.1.1 Count Variable Model

Let Y1, Y2, . . . .Yp be numerical count variables. The resulting data matrix Yhas counts as values and, as such, all values are nonnegative. yij ≥ 0 is thenumber of occurrences of Yj on the ith individual.

Let∑ij yij = y..;

∑j yij = yi.;

∑i yij = y.j . Then, Yp = Y/y.. represents

the proportions of the total occurrences. Further, let yr = Yp1p be the vectorof row proportions, i.e., the vector with ith element yi./y.. and let yc = Y′p1nbe the vector of column proportions, i.e., the vector with jth element y.j/y...Then define:

Yp = Yp − yry′c =1y..

[yij −

yi.y.jy..

].

This corresponds to the residuals computed as the observed proportions mi-nus the estimated expected proportions under a model of “independence.” Theresiduals should be small if the count proportions across variables do not varysubstantially form individual to individual.

Correspondence analysis treats Yp as the data matrix, which is a doublycentered form of the original data matrix. The representation of the rows andcolumns of Yp is computed from a generalized form of its singular value decom-position. Specifically, let

Yp = VDdjA′, (6.1)

where Ddj is a diagonal matrix with the singular values dj on the diagonal andthe orthogonality of V and A are defined by metrics which account for therelative frequencies of the rows and columns, respectively. Specifically, let:

Dr = Diag[yr] and Dc = Diag[yc].

The orthogonality is defined using D−1r and D−1

c as metrics, i.e., V′D−1r V = I

and A′D−1c A = I.

By “standardizing” Yp, correspondence analysis can be computed using theordinary singular value decomposition. The standardized residual matrix isgiven by:

Yps = D−1/2r YpD−1/2

c . (6.2)

The residual in the ijth position is standardized by the size of the estimated ex-pected relative frequency under “independence.” More specifically, the followingstandardization is used:

Yps =

[yij − yi.y.j

y..√yi.y.j

]= [rij ].

The assessment of “independence” generally is based on a χ2 test computed asy..∑r2ij .


The singular values decomposition of Yps is:

Yps = VsDdjA′s,

where the singular values of Yp and Yps are the same, but V′sVs = I andA′sAs = I. The generalized left and right singular vectors can be expressedin terms of Vs and As. Substituting Equation 6.1 into Equation 6.2 givesV = D1/2

r Vs and A = D1/2c As.

Thus, correspondence analysis attempts to explain the variability amongthe residuals using a generalized form of PCA. Let Y sj be the “variable” withobserved values given by the jth column of Yps. Then derived variables canbe defined which are uncorrelated and account for successively smaller vari-ances, i.e., they explain successively smaller amounts of the variability amongthe residuals.

The jth correspondent variable, αV sj , is defined as a linear combination of

the Y sk with coefficients given by the jth column of D−1/2c AD1−α

dj. The values

of αV sj are given by D−1/2r VDα

djfor 0 ≤ α ≤ 1. The usual choices for α are 0,

1/2, and 1 as with PCA.

6.1.2 Categorical Variable Model

Correspondence analysis for two categorical variables is developed in this section.Generalizations to more than two variables are given in Section 6.4.

Let Yj be a categorical variables with jg levels. Further let Yji be the ith

dummy variable for Yj and Yj the matrix of dummy variables. Now consider thecase in which Y1 and Y2 are categorical variables with r and c levels, respectively.The r×c contingency table representing the numbers of joint occurrences is givenby:

Y = Y′1Y2.

This Y is used in place of the direct count matrix used in the development ofcorrespondence analysis given in Section 6.1.

The hypothesis of independence between Y1 and Y2 really is of interest;deviations form independence are expressed by the pattern of residuals. Therows and columns of Y, and hence the resulting left and right singular vectors,are considered symmetrically.

6.2 Plotting Correspondent Variables

The generalized singular value decomposition in 6.2 can be expressed as:

Yps = VsDαdjD

1−αdj

A′s = GH′.

A joint plot can be contructed from the rows of G and H, which represent therows and columns of Yps.

Although the derivation is identical to that of principal components, theinterpretation is not.


6.3 Constrained Correspondence Analysis

6.4 Multiple Correspondence Analysis

Chapter 7

Factor Analysis

7.1 The Basic Model

The factor model is:Y = ΛF +Z,

where, for q ≤ p, Y (p × 1) is the vector of observable random variables, Λ(p× q) is the matrix of factor loading, F (q× 1) is the vector of (unobservable)common factors, and (Z) (p×1) is the vector of unique factors. This model canbe represented in the following diagram:

This model can also be written as:

Yj = λj1F1 + λj2F2 + · · ·+ ajqFq + Zj

= λjF + Zj ,

where λj is the jth row of Λ, j = 1, 2, . . . , p.The following distributional assumptions are made for this model:

1. E(Y ) = 0; V (Y ) = Σy

2. E(F ) = 0; V (F ) = Σf

3. E(Z) = 0; V (Z) = ∆

4. Cov(F ,Z) = 0,

where

∆ =

δ21 0 · · · 00 δ2

2 · · · 0...

.... . .

...0 0 · · · δ2

p

,

61

CHAPTER 7. FACTOR ANALYSIS 62

i.e., the unique factors are assumed to be uncorrelated. From the linear modelit can be shown that

Σy = ΛΣfΛ′ + ∆.

This is the oblique model. If Σf = I, the model is said to be orthogonal. Inthis case,

Σy = ΛΛ′ + ∆.

Notice that the factor variables Fj are both uncorrelated and standardized inthe orthogonal model.

Often the correlation matrix is used:

Γ = ΛΛ′ + ∆,

where Γ = (ρij) is the population correlation matrix. In this case, the originalvariables Yi are standardized. Let

Γ∗ = Γ−∆,

where Γ∗ is called the reduced correlation matrix. The number of commonfactors required to reproduce Γ is q, the rank of Γ∗.

How are the factors derived? The following method for determining thefactors is applied to the reduced correlation matrix although it could be appliedto a reduced covariance matrix.

7.2 The Principal Factor Method

The factors are determined one at a time. Suppose we want the factor variableF 1 which makes the largest contribution to the communality (see ...) of theoriginal varialbes, i.e.., maximize

V1 =p∑i=1

λ2i1 = λ′1λ1,

such that Γ∗ = ΛΛ′, etc.This problem is equivalent to finding the eigenvalues and eigenvectors of Γ∗,

the reduced correlation matrix. Denote the eigenvalues by γ1 ≥ γ2 ≥ · · · ≥ γqand the corresponding eigenvectors by α1,α2, . . . ,αq.

Then V1 = γ1 and λ1 =√γ1α1; V2 = γ2 and λ2 =

√γ2α1; etc.

7.3 Variable Space Interpretation

The following figure shows how the Yi are related to the Fj :

CHAPTER 7. FACTOR ANALYSIS 63

Thus, Yi is partitioned into two uncorrelated (orthogonal) variables:

Yi = λiF + Zi

= Ci + Zi,

for i = 1, 2, · · · , p. Ci is the variable in the common factor space and Zi is thevariable unique to Yi. The mean of Ci is clearly 0 and the variance is given by:

V (Ci) = V (λiF )

= λiΣfλi′

= λiλi′

=p∑j=1

λ2ij

= h2i ,

assuming Σf = I. Thus, V (Yi) = σ2 = h2i + δ2

i .The total variance of Y is defined by:

tr(Σy) =p∑i=1

σ2i

=∑

(h2i + δ2

i )

= V + δ.

V =∑h2i is called the total communality and δ =

∑δ2i is the total uniqueness.

The covariance of Yi and Fj is given by:

Cov(Yi, Fj) = λij .

Thus the factor loading represent covariances.

7.4 Sample Principal Factors

A factor analysis is actually computed on the n×p data matrix Y. Specifically,we want to decompose Y by:

Y = FΛ + Z,

where F is n× q, Λ is q × p, and Z is n× p.The sample reduced correlation matrix R∗ is used in place of Γ∗ defined by:

7.5 The rotation Probelm

Chapter 8

Discriminant Analysis

The discriminant model is used to understand how groups differ in multivariatespace or to classify observations into groups based on numeric variables. Inparticular, the outcome variable Y is categorical representing the groups thatclassify individuals and the explanatory variables (X1, X2, . . . , Xq) are numeric.

8.1 The Basic Model

Two issues are discussed in the following sections:

1. Are the group means different?

2. How is a new individual classified?

If group means differ, the researcher often wants to characterize the discrimi-nating variable space.

The two-group model is considered first, i.e., the outcome variable Y isbinary.

8.1.1 The 2-sample Problem

The two-group discriminant model is concerned with testing for a mean dif-ference or classifying an individual into one of two groups, G1 or G2. Theexplanatory variables are assumed to have a multivariate normal distribution,i.e.,

f(x1, x2, . . . , xq|y = i) = n(x;µi,Σ), i = 1, 2.

Let ni be the number of observations for which y = i, i = 1, 2, i.e., the (xi, yi)are randomly determined by sampling. Alternatively, it is possible to fix Y inthe sense that ni observations are sampled for Y = i.

Sample statistics are computed for each group. The sample mean Xi esti-mates µi and the sample covariance matrix, Si, estimates Σi, i = 1, 2. The

64

CHAPTER 8. DISCRIMINANT ANALYSIS 65

classical model assumes that Σ1 = Σ2 = Σ and hence S1 and S2 both estimateΣ.

Consider testingH0 : µ1 = µ2.

The test will be the two-sample t-test along the dimension of greatest discrimi-nation.

Let

U = a′X

= a1X1 + a2X2 + · · · aqXq.

The two-sample t-test along U is defined by

ta =a′(X1 − X2)SEa′(X1−X2)

.

The standard error in the denominator is given by√a′Sa(

1n1

+1n2

),

where

S =(n1 − 1)S1 + (n2 − 1)S2

(n1 + n2 − 2)

is the pooled within covariance matrix. Therefore,

t2a = (n1n2

n1 + n2){a′(X1 − X2)(X1 − X2)′a

a′Sa}

The objective is to maximize t2a over a. This is the generalized eigenvalueproblem given by

Ba = cSa,

whereB = (

n1n2

n1 + n2)(X1 − X2)(X1 − X2)′.

The solution a is proportional to

S−1(X1 − X2).

Therefore, the sample discriminant variable is given by

U = (X1 − X2)S−1X.

It is the linear combination of the original variables which has the largest ratioof the between groups to the within groups variation.

The maximum of t2a is attained at

c = (n1n2

n1 + n2)(X1 − X2)S−1(X1 − X2)′,


which is the eigenvalue associated with the eigenvector a. Note that B has rank1 and thus there is only one positive eigenvalue.

The test statistic is based on c and is called Hotelling’s T 2 statistic, whichis proportional to a F statistic. (X1 − X2)S−1(X1 − X2)′ is called the Ma-hanonobis D2 statistic.

The hypothesis can be tested by

n1 − n2 − q − 1q(n1 + n2 − 2)

T 2 > Fα; q, (n1 + n2 − 2).

If the hypothesis is rejected, the researcher generally wants to determine whichvariables led to the rejection. This will be delayed until the general g-sampleproblem is discussed.

8.1.2 The g-sample Problem

If Y has g categories, then g groups (G1, G2, . . . , Gg) are defined. Let ni be thenumber of observations in the ith group and n =

∑ni.

The sample mean vector for the ith group is given by Xi and the samplecovariance matrix by Si (i = 1, 2, . . . , g). The grand (overall) mean vector isgiven by

X =∑niXi

n.

The pooled within-group covariance meatrix is defined by

W =1

n− g∑

(ni − 1)Si.

The between-groups covariance matrix is

B =1

g − 1

∑ni(Xi − X)(Xi − X)′

or possibly

B? =1

g − 1

∑(Xi − X)(Xi − X)′

Now consider testing

H0 : µ1 = µ2 = · · · = µg.

X is assumed to be multivariate normal within each group with a commoncovariance matrix, i.e., Σ1 = Σ2 = · · · = Σg = Σ, but possibly with differentmeans. The assumption of multivariate normality should be assessed based onthe multivariate residuals from the one-way multivariate analysis of variancemodel, as determined in the next chapter.

Assessing the equality of covariance matrices can be done formally, but thistest is not powerful. However, if the covariances, i.e., the Σi, are equal, theirq-dimensional ellipsoidal representation will have the same orientation and size.


This can be assessed by performing a PCA on each group sample covariance, i.e.,the Si. If the corresponding eigenvalues are approximately equal, then the sizesof the ellipsoids are approximately the same. If the corresponding eigenvectorsare approximately equal, then the orientations of the ellipsoids are aproximatelythe same.

The test statistic for the above hypothesis is now developed assuming theunderlying assumptions are met. Consider the linear combination

U = a′X,

which maximally discriminates among the g groups in the sense that

Fa =a′Baa′Wa

is maximized. The numerator is the between-group and the denominator is thewithin-group mean square along U . This is the generalized eigenvalue problem

Ba = cWa.

The vector a which maximizes Fa is the eigenvector corresponding to the largesteigenvalue c1. Thus,

U1 = a′1X

= a11X1 + a12X2 + · · · a1qXq

is the linear combination which maximally differentiates among the g groups.Once this dimension is found, assuming g > 2 and q > 1, a second dimension

corresponding to the eigenvector a2 associated with the second largest eigenvaluec2 is found:

U2 = a′2X

= a21X1 + a22X2 + · · · a2qXq.

This process is continued for the s positive eigenvalues

c1 ≥ c2 ≥ · · · ≥ cs > 0,

where s = min(g − 1, q).U1, U2, . . . , Us are called the sample discriminant variables and they span a

s-dimensional space called the discriminant space. If some of the ci ≈ 0, not alldimensions of this space will be significant discriminators.

The eigenvectors satisfy

a′jWak ={

1 if j = k0 otherwise

Thus the discriminant variables are not orthogonal according to the identity (I)metric, i.e.,

a′jak 6= 0 j 6= k,


but rather are orthogonal according to W.The construction of the discriminant variable are illustrated by the Fisher’s

iris data.

> library(mult)

> data(iris)

> iris[1:3,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa

> iris.disc <- disc(iris[,1:4], iris[,5])

The coefficients of the discriminant variables are given by:

> iris.disc$A

Disc1 Disc2Sepal.Length -0.2087418 0.006531964Sepal.Width -0.3862037 0.586610553Petal.Length 0.5540117 -0.252561540Petal.Width 0.7073504 0.769453092

Based on the coefficient signs, the first discriminant variable appears to be acontrast between the sepal and the petal dimensions. The second discriminantvariable is a width dimension primarily. However, these statements should bemoderated by examining the significance of the discriminant dimensions and thecorrelations between the discriminant and original variables.

Next, we consider the effective dimensionality of the discriminant space. Thetest statistics are based on certain functions of the eigenvalues of the generalizedeigenvector problem:

Sha = dSea,

where Sh = (g−1)B is the hypothesis sum-of-squares and cross-products matrixand Se = (n− g)W is the error (or residual) sum-of-squares and cross-productsmatrix. This generalized eigenvalue problem is equivalent to that based on meansquares since

d =a′Shaa′Sea

=(g − 1)a′Ba(n− g)a′Wa

=(g − 1)(n− g)

c.

Thus the eigenvectors are equivalent for both problems, and hence the discrim-inant spaces are identical, and the eigenvalues are proportional, i.e.,

di =(g − 1)(n− g)

ci,

which simply represents the different scaling.


Let δi be the theoretical eigenvalues corresponding to the di. An overallhypothesis that the discriminant space differentiates among the g groups is:

H0 : δ1 = δ2 = · · · = δs = 0.

The test statistics, derived from the likelihood ratio test, is given by:

V = [(n− 1)− 12

(q + g)]s∑i=1

log(1 + di),

which is approximately distributed as a χ2 with q(g − 1) degrees of freedom.The discriminant variables are uncorrelated. Thus, we can test the significanceof each discriminant dimension. The hypothesis that δj = 0 is tested by:

Vj = [(n− 1)− 12

(q + g)] log(1 + dj),

which is approximately distributed as a χ2 with q + g − 2j degrees of freedom.Now consider testing the hypothesis that the last s− t eigenvalues are zero,

i.e,H0 : δt+1 = δt+2 = · · · = δs = 0.

Under the null hypothesis,

V t =s∑

i=t+1

Vi = [(n− 1)− 12

(q + g)]s∑

i=t+1

log(1 + di)

has an approximate χ2-distribution with (q−t)(g−1−t) degrees of freedom. Thistests for the residual discrimination after partialing out the first t discriminantvariables. Testing is generally done sequentially for t = 0, 1, . . . , s− 1 until thenull hypothesis is accepted. Alternately, the test can be conducted sequentiallyfor t = s − 1, s − 2, . . . , 0 until the null hypothesis is rejected. Note that t = 0corresponds to the test statistic V above. This approach is similar to determinngthe canonical variate dimensionality.

> iris.disc$d

[1] 32.191929 0.285391

> iris.disc$Dp

[1] 0.991212605 0.008787395

> iris.disc$Vt

[1] 546.11530 36.52966

> iris.disc$p.values


[1] 8.870785e-113 5.786050e-08

Note that the first diecriminant variable accounts for over 99% of the amonggroup variation in discriminant space. Despite this, both discriminant dimen-sions are significant although groups differences are overwhelingly establishedalong the first discriminant dimension.

If H0 is rejected, it is generally useful to examine the dimensions along whichdiscrimination takes place. Alternatively, it may also be desirable to examinesubhypotheses involving the means.

8.2 Interpreting Discriminant Variables

The ith discriminant variable is defined by

Ui = a′iX.

The Ui are also called canonical variables; the space spanned by U1, U2, . . . Ut(t ≤ r) is called the discriminant space of dimension t.

The coordinates of the n observations in discriminant space are:

U = XAt,

where At is the q×t matrix of normalized eigenvectors (aiWaj = δij) associatedwith the t largest eigenvectors. Frequently, we plot Ui versus Uj for i < j ≤ tfor the smallest dimension t such that:∑t

i=1 di∑si=1 di

≈ 1.

The plots show the group separations.The meaning attached to Ui is derived from the ai. The coefficients aij

cannot be used directly since they are dependent on the scale of the Xj . Tradi-tionally, the importance of Xj to Ui is measured by

corr(Ui, Xj),

not by aij .

corr(Ui, Xj) =a′iW1j√

1 · wjj

=

a′i

w1j

...wqj

√wjj


It is easy to develop the correlation matrix between U ′ = (U1, U2, . . . , Us) andX. Let Aq×r = (a1,a2, . . . ,as) have as columns the scaled eigenvectors, i.e.,a′iWaj = δij . Then:

cov(U ,X) = A′WI = A′W

andcorr(U ,X) = A′WD1/

√wii ,

where

D1/√wii =

1/√w11 0 · · · 0

0 1/√w22 · · · 0

......

. . ....

0 0 · · · 1/√wqq

and wii is the ith diagonal element of W.

> iris.disc$Dcor

Disc1 Disc2Sepal.Length 0.7918878 0.21759312Sepal.Width -0.5307590 0.75798931Petal.Length 0.9849513 0.04603709Petal.Width 0.9728120 0.22290236

Based on correlations, differentiation among the species along the first discrim-inant variable is stronger along the petal dimensions, although the setal dimen-sions now show a contrast. The second discriminant variables is still associatedwith the width variables.

An alternative method of interpreting the aij is sometimes used. Let

Ui = a′iX = ai1X1 + ai2X2 + · · ·+ aiqXq

= (ai1w1)X1

w1+ (ai2w2)

X2

w2+ · · ·+ (aiqwq)

Xq

wq,

where wi =√wii. Thus

a?′i = (ai1w1, ai2w2, · · · , aiqwq)

are the coefficients in terms of the “standardized” variables.

Example Using lda

Discriminant analysis will be illustrated using Fisher’s Iris data based on thelda function in MASS.

> library(MASS)

> attach(iris)

> iris.vars <- cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)

> iris.species <- as.factor(Species)

> (iris.lda <- lda(iris.vars, iris.species))


Call:lda(iris.vars, grouping = iris.species)

Prior probabilities of groups:setosa versicolor virginica

0.3333333 0.3333333 0.3333333

Group means:Sepal.Length Sepal.Width Petal.Length Petal.Width

setosa 5.006 3.428 1.462 0.246versicolor 5.936 2.770 4.260 1.326virginica 6.588 2.974 5.552 2.026

Coefficients of linear discriminants:LD1 LD2

Sepal.Length 0.8293776 0.02410215Sepal.Width 1.5344731 2.16452123Petal.Length -2.2012117 -0.93192121Petal.Width -2.8104603 2.83918785

Proportion of trace:LD1 LD2

0.9912 0.0088

> iris.discvar <- predict(iris.lda, dimen=2)$x

> eqscplot(iris.discvar, type="n", xlab = "Disc1", ylab = "Disc2")

> text(iris.discvar, labels = as.character(substring(iris.species, 1, 2)), col = 3 + 2*unclass(iris.species), cex = 0.5)


−5 0 5 10

−5

05

Disc1

Dis

c2 se

se

sese

se

se

sese

se se

se

se

sese

se

se

se

sesese

se

sese

se

se

se

sesese

sese

se

se

se

sese

sese

se

se

se

se

se

se se

se

se

se

se

seveve

ve

ve

veve

ve

ve

veve

ve

ve

ve

ve

veveve

veve ve

ve

ve

ve ve

veve

ve

veve

veve ve

veve

ve

ve

ve

ve

ve

veve

ve

ve

ve

veveveve

veve

vi

vi

vi

vi

vi

vivi

vivi

vi

vi

vi

vi

vi

vivi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vivivi

vi

vivi

vi

vi

vi

vi

vi

vi

vivi

vi

vi vi

vi

vi

vi

vi

vi

vi

vi

vi

> plot(iris.lda, dimen=1)

> plot(iris.lda, type = "density", dimen=1)


−10 −5 0 5 10

0.0

0.2

0.4

group setosa

−10 −5 0 5 10

0.0

0.2

0.4

group versicolor

−10 −5 0 5 10

0.0

0.2

0.4

group virginica

8.2.1 Stepwise Discriminant Analysis

The discriminant model contains q predictors, but perhaps not all of these vari-ables discriminate among the g groups. A stepwise procedure can be imple-mented to select the important discriminators using partial F -tests. The basicidea is to use each predictor as the response in a one-way analysis of covariancein which the grouping variable is the factor and the other predictor variables actas covariates. It is not reasonable to just use each predictor as an outcome vari-able in a one-way ANOVA since this does not take into account the correlationamong the predictors.

The stepwise procedures proceeds as follows:

1. Model: Xj ∼ G j = 1, 2, . . . , q. Select Xj with the hightest F > F-to-enter. Assume Xk is selected.

2. Model: Xj ∼ Xk +G j 6= k. Select Xj with the highest F > F-to-enter,where the F is associated with the G factor. Assume Xi is selected.

3. Model: Xk ∼ Xi+G. If F < F-to-remove, where the F is associated withthe G factor, then remove Xk.

4. Continue entering and removing variables unless none are entered or re-moved.


8.3 Classification

Classification is the problem of assigning an object to one of g groups. Theobject is assigned to the group to which it is closest by some metric. Letx′ = (x1, x3, . . . , xq) be the q-dimensional observation to be assigned. Typically,u is an observation with unknown class membership. The generalized squareddistance of u to the ith group mean, Xi is:

D2(i) = (x− Xi)′M(x− Xi),

where M is positive semi-definite (p.s.d), which insuresD2(i) ≥ 0, i = 1, 2, . . . , g,i.e., M is a metric.

Three choices for M are commonly used.

1. M = I results in squared Euclidean (spherical) distances. The Euclideandistance is rarely used since it does not incorporate either the scales ofmeasurement of the q predictors or their correlations into the classification.

2. M diagonal results in ellipsoidal distance with the principal axes parallelto the coordinate axes. This metric does not incorporate the correlationsamong the predictors and thus is not used unless the predictors are knownto be orthogonal.

3. M p.s.d. results in ellipsoidal distance without constraints. This is thebasis of the metrics commonly used in discriminant classification.

The following choices for M p.s.d. are based on the pooled within covari-ance estimate, the individual within covariance estimates, and metrics in thediscriminant space:

1. M = W−1, where W is the pooled within covariance matrix estimate, isthe most commonly used metric. Thus, M remains constant from group togroup and it is equivalent to performing a linear discriminant analysis (lda).This approach is appropriate if the covariance matrices, i.e., the Σi, are homo-geneous from group to group.

For the g-group problem, the distances are:

D21(i) = (x− Xi)′W−1(x− Xi), for i = 1, 2, . . . , g,

where (n− g) ≥ q. x is assigned to group k, i.e., Gk, if

D21(k) = min

i=1,2,...,gD2

1(i).

This is also the maximum likelihood classifier.In the two-group problem, assign x to G1 if

D21(1)−D2

1(2) ≤ 0,

and G2 otherwise. The set {x |D21(1) − D2

1(2) = 0} defines a q-dimensinalhyperplance, which partitions the the predictor-space into a G1 decision region


and a G2 region. More generally,(g2

)hyperplanes partition the q-dimensional

predictor space into g regions.The classification problem is often expressed within a Bayesian framework if

prior probabilities can be assigned to the groups or classes. Let p(i) = πi be theprior probability of selecting class i. An empirical approach is to use pi = ni

n ifthe sampling is done randomly over all groups.

In this case the decision function becomes:

D21(k)− 2 log πk = min

i=1,2,...,gD2

1(i)− 2 log πi.

For the two-group problem this becomes:

D21(1)−D2

1(2) ≤ 2 logπ1

π2,

Notice that these expressions reduce to the previous expressions if the priorprobabilities are equal.

Using the lda method, the D1-based confusion matirx for the iris data is givenby:

> iris.disc$Pooled.Class

pred1y setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49

Only three observations are misclassified using lda. Specifically, two versicoloririses are classified as virginica and one virginica iris is classified as a versicolor.

2. M = S−1i , where Si is the within covariance matrix estimate of Σi, i.e., the

dispersion estimate is internal to the group being computeded. This is equiv-alent to performing a quadratic discriminant analysis (qda). This approach isuseful if homogeneity of covariance matrices cannot be assumed.

For the g-group problem, the distances are:

D22(i) = (x− Xi)′S−1

i (x− Xi), for i = 1, 2, . . . , g,

where (ni − g) ≥ q. x is assigned to group k, i.e., Gk, if

D22(k) = min

i=1,2,...,gD2

1(i).

as before. The likelihood-ratio classifier, assuming multi-normality, is given by:

D22(k) + log|Sk| = min

i=1,2,...,g{D2

1(i) + log|Si|}.

Using the qda methed, the D2-based confusion matrix for the iris data is givenby:


> iris.disc$Within.Class


Three observations are also misclassified using qda. Specifically, three versicoloririses are classified as virginica.

3. M = AtA′t, where At has as columns the t eigenvectors associated withc1 ≥ c2 ≥ . . . ≥ ct for t ≤ s. This classification is based on the t-dimensionaldiscriminant subspace from the lda analysis. If t = s, D2

1() = D23().

The distances in t-dimensional discriminant space are given by:

D23(i) = (x− Xi)′AtA′t(x− Xi)

= (A′tx−A′tXi)′(A′tx−A′tXi).

Notice that the metric in discriminant space is the simple Euclidean distance.The metric does not depend on the group, but it does depend on the numberof eigenvalues t used. This can be determined by testing the dimensionality ofthe discriminant space as discussed previously.

Using the discriminant approach, the D3-based confusion matrix for the irisdata is given by:

> iris.disc$Disc.Class


Since the full discriminant space is used, this method is equivalent to lda.

Chapter 9

Multivariate General LinearModels

9.1 Multivariate Regression

9.2 Multivariate Analysis of Variance

The one-way multivariate analysis of variance (MANOVA) is equivalent to per-forming an lda. However, the X variables now become the Y variables and theclassification variable becomes the categorical explanatory variable X. Specifi-cally, we are interested in the following hypothesis:

H0 : µ1 = µ2 = · · · = µg.

The SSCP matrix associated with the hypothesis, Sh, and the SSCP errormatrix, Se, are computed first. The test statistics are based on certain functionsof the eigenvalues of the following generalized eigenvector problem:

Sha = dSea.

Three tests are commonly used:

1. Wilk’s likelihood ratio test is given by

Λ =r∏i=1

1(1 + di)

∼ U(q, g − 1, n− g).

Reject H0 if Λ < Uα(q, g − 1, n − g), i.e., if at least some eigenvalues arelarge.

2. Roy’s maximum root is given by

θ1 =d1

1 + d1∼ θ(r,|(g − 1)− q| − 1

2,

(n− g)− q − 12

)Reject for θ1 > θα, i.e., if the largest eigenvalue is large.

78

CHAPTER 9. MULTIVARIATE GENERAL LINEAR MODELS 79

3. Hotelling-Lawley’s trace is given by

r∑i=1

di ∼ U0

(r,|(g − 1)− q| − 1

2,

(n− g)− q − 12

).

Reject if∑di > Uα0 , i.e., if at least some eigenvalues are large.

Generally, it is wise to look at all 3 tests for the equality of means. I preferWilk’s Λ unless d1 ≫ d2 ≈ d3 ≈ · · · ≈ dr ≈ 0. In this later case, Roy’s test ismore powerful and discrimination only takes place along a single dimension.

9.3 Repeated Measures

Chapter 10

Exploratory ProjectionPursuit

10.1 Basic Concepts

10.2 Projection Pursuit Indices

10.3 Projection Pursuit Guided Tours

80

Chapter 11

Cluster Analysis

The essence of clustering is to find groupings of n units such that the unitswithin groups are more similar than units across groups. We will study thefollowing types of clustering:

1. Agglomerative hierarchical methods

2. Divisive hierarchical methods

3. Non-hierarchical or partitioning methods

The methods discussed in this chapter assume that the clustering is done on an×p multivariate data set. The techniques can be applied to any similarity (e.g.,correlation, χ2) or dissimilarity (distance measure) matrix computed betweenthe pairs of objects to be clustered. However, the methods in this chapter areonly illustrated on distance measures.

A dissimilarity measure is:

1. symmetric, i.e., d(yi,yj) = d(yj ,yi);

2. non-negative, i.e., d(yi,yj) ≥ 0, and d(yi,yi) = 0.

Dissimilarities may be metric, i.e., satisfy the triangle inequality,

d(yi,yk) ≤ d(yi,yj) + d(yj ,yk)

or ultrametric, i.e.,

d(yi,yj) ≤ max(d(yi,yk), d(yj ,yk)).

Agglomerative methods (see below) result in a binary tree structure or den-drogram, which is true if the distance measure is ultrametric. The Euclideandistance is most commonly used.

81

CHAPTER 11. CLUSTER ANALYSIS 82

11.1 Agglomerative Hierarchical Clustering

Agglomerative clustering algorithms are based on a dissimilarity measure forthe n observations or a similarity measure for the p variables. Clustering theobservations is far more common except, e.g., in genomic analyses.

Agglomerative clustering start with an n× n symmetric distance matrix Ddefined by:

D =[dij],

where dij = d(yi,yj).In hierarchical clustering, a nested sequence of clusterings is found beginning

with a weak clustering (one unit per cluster) and ending with a strong clustering(a single cluster). Let Ci represent a clustering. Then a sequence is found suchthat

C1 � C2 � · · · � Cm.

where m equals n or p depending on whether the observations or variables arebeing clustered. Each clustering partitions the data such that each data pointfalls in exactly one partition. Corresponding to each clustering Ci is a measureof the strength of that clustering αi. The αi values are monotonically relatedto the clustering as follows:

α1 ≤ α2 ≤ · · · ≤ αm

for distance-type measures, or

α1 ≥ α2 ≥ · · · ≥ αm

for correlation-type measures.Why are there many clustering strategies? Suppose data points x and y

are joined in step 1 to form the cluster [x,y]. How is the distance d([x,y], z)computed? Three possibilities are:

1. the minimum or single-linkage method in which the dissimilarity matrixis updated by d([x, y], z) = min[d(x, z), d(y, z)];

2. the maximum or complete-linkage method in which the dissimilarity ma-trix is updated by d([x, y], z) = max[d(x, z), d(y, z)];

3. the average-linkage method in which the dissimilarity matrix is updatedby d([x, y], z) = [d(x, z) + d(y, z)]/2

All agglomerative hierarchical methods follow the same basic algorithm:

1. Start with the weak clustering consisting of one unit per cluster.

2. Join the nearest pair of clusters, i.e,. join the pair with minimum d(., .)and update the distance matrix as described above.

3. Stop when the number of cluster is 1, else go to step 2.


Example: An example is now presented involving selected crime data for sev-eral American cities. The small sample size and number of crime variables allowverification by hand calculation.

> city.cr <- matrix(c(13.2, 34.9, 564, 10.5, 19.9, 665, 9.3, 15.2,

+ 173, 4.4, 14, 145, 11.4, 23, 504), 5, 3, byrow = T)

> rownames(city.cr) <- c("Baltimore", "New York", "Philadelphia",

+ "Pittsburgh", "Washington")

> colnames(city.cr) <- c("Murder", "Rape", "Robbery")

> city.cr

Murder Rape RobberyBaltimore 13.2 34.9 564New York 10.5 19.9 665Philadelphia 9.3 15.2 173Pittsburgh 4.4 14.0 145Washington 11.4 23.0 504

The rates are crimes per 100,000, but the mean occurances vary widely depend-ing on the crime. Thus, the data matrix should be scaled (standardized).

> city.sc <- scale(city.cr)

> (D <- dist(city.sc))

Baltimore New York Philadelphia PittsburghNew York 2.0139600Philadelphia 3.1067223 2.1838697Pittsburgh 4.0506618 2.9509033 1.4888540Washington 1.5426868 0.8207984 1.7960618 2.8126761

Single Linkage

> h.single <- hclust(D, method = "single")

> plclust(h.single, ann = FALSE, main = "Single Linkage")


Phi

lade

lphi

a

Pitt

sbur

gh

Bal

timor

e

New

Yor

k

Was

hing

ton

0.8

1.0

1.2

1.4

1.6

1.8

Complete Linkage

> h.complete <- hclust(D, method = "complete")

> plclust(h.complete, ann = FALSE, main = "Complete Linkage")


Phi

lade

lphi

a

Pitt

sbur

gh Bal

timor

e

New

Yor

k

Was

hing

ton

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Average Linkage

> h.average <- hclust(D, method = "average")

> plclust(h.average, ann = FALSE, main = "Average Linkage")


Phi

lade

lphi

a

Pitt

sbur

gh

Bal

timor

e

New

Yor

k

Was

hing

ton0.5

1.0

1.5

2.0

2.5

Notice that the tree structures for single, complete, and average linkage areidintical, but the merging levels differ.

11.2 Measures of Fit

LetD be the original dissimilarity matrix, i.e, dij is the distance between the uniti and unit j. Further, let D be the derived dissimilarity matrix, i.e., dij = minαksuch that unit i and unit j are joined in the dendrogram.

Two commonly used measures of fit are:

1. measures of distance between D and D;

2. the cophenetic correlation.

D− D defines the residuals of the clustering “model,” i.e., the dij − dij . Thedij can be viewed as the observed values and the dij can be viewed as the fittedvalues. The distance between D and D can be defined by the p-norms, i.e.,

‖D − D‖p = (∑i<j

|dij − dij |p)1/p.

A special case is the Euclidean distance defined by

‖D − D‖ =√

[∑i<j

(dij − dij)2],


and a weighted version is given by

‖D − D‖w =√

[∑i<j

wij(dij − dij)2].

The smaller the value of the p-norm, the better the fit.Several clustering methods can be compared in terms of their fit by using,

e.g., the Euclidean norm. Additional information is available if the dij valuesare plotted against the dij , i.e., the observed vs. the fitted or the residuals,dij − dij , vs. the dij . The idea is to look for pattern to determine where the fitis poor.

The second measure for measuring the fit of the cluster is the copheneticcorrelation. It is simply the standard correlation between the dij and the dij ,i.e,

rc =∑dij dij − (

∑dij)(

∑dij)/

(n2

)√(∑d2ij − (

∑dij)2/

(n2

))(∑d2ij − (

∑dij)2/

(n2

))

Example (cont.): The merge component from the returned values of the func-tion hclust() provides the clusters merged at step i, i = 1, 2, . . . , n − 1 andthe height component provides the level at which the clusters were mergedfor each i. This information can be used to construct the derived dissimilaritymatrix D.Single Linkage

> h.single.history <- cbind(h.single$merge, h.single$height)

> colnames(h.single.history) <- c("leader", "joiner", "distance")

> rownames(h.single.history) <- paste("C", 1:(nrow(city.cr) - 1))

> h.single.history

leader joiner distanceC 1 -2 -5 0.8207984C 2 -3 -4 1.4888540C 3 -1 1 1.5426868C 4 2 3 1.7960618

If j in the history table is negative then the single element −j was merged inthis step. If j is positive, then the merge was with cluster j from an earlier step.The distances in the table are the αi values from above. The derived distancesD are determined from the last column in the table.

> D.single <- c(1.5426868, 1.7960618, 1.7960618, 1.5426868, 1.7960618,

+ 1.7960618, 0.8207984, 1.488854, 1.7960618, 1.7960618)

The Euclidean norm of the “residuals” is given by:

> sqrt(sum((as.vector(D) - D.single)^2))

[1] 3.088804


The cophenetic correlation is:

> cor(as.vector(D), D.single)

[1] 0.7250909

Complete Linkage

> h.complete.history <- cbind(h.complete$merge, h.complete$height)

> colnames(h.complete.history) <- c("leader", "joiner", "distance")

> rownames(h.complete.history) <- paste("C", 1:(nrow(city.cr) -

+ 1))

> h.complete.history


> D.complete <- c(2.01396, 4.0506618, 4.0506618, 2.01396, 4.0506618,

+ 4.0506618, 0.8207984, 1.488854, 4.0506618, 4.0506618)


> sqrt(sum((as.vector(D) - D.complete)^2))

[1] 3.524675


> cor(as.vector(D), D.complete)

[1] 0.7763732

Average Linkage

> h.average.history <- cbind(h.average$merge, h.average$height)

> colnames(h.average.history) <- c("leader", "joiner", "distance")

> rownames(h.average.history) <- paste("C", 1:(nrow(city.cr) -

+ 1))

> h.average.history


> D.average <- c(1.7783234, 2.8168158, 2.8168158, 1.7783234, 2.8168158,

+ 2.8168158, 0.8207984, 1.488854, 2.8168158, 2.8168158)



> sqrt(sum((as.vector(D) - D.average)^2))

[1] 1.782702


> cor(as.vector(D), D.average)

[1] 0.7815354

11.3 Other Agglomerative Methods

Ward’s method optimizes an objective function, generally the within error sumof squares. This method is only applicable for distance-type measures. Ward’smethod is commonly used since a statistical objective function is optimized.

> h.ward <- hclust(D, method = "ward")

> plclust(h.ward, ann = FALSE, main = "Ward's Method")

Phi

lade

lphi

a

Pitt

sbur

gh

Bal

timor

e

New

Yor

k

Was

hing

ton

01

23

4


The centroid method has an intuitive appeal since clusters are merged withthe nearest centroids. However, it too is only suitable with distance-type mea-sures. Further, this method is subject to reversals, i.e., the ultrametric is vio-lated, and thus is rarely used.

> h.centroid <- hclust(D, method = "centroid")

> plclust(h.centroid, ann = FALSE, main = "Centroid Method")P

hila

delp

hia

Pitt

sbur

gh

Bal

timor

e

New

Yor

k

Was

hing

ton

0.8

1.0

1.2

1.4

1.6

1.8

2.0

11.4 Divisive Hierarchical Clustering

11.5 Non-hierarchical Clustering

Non-herarchical clustering classifies n data units into k clusters, where k is pre-specified. Two principal approaches are the k-means algorithm and IterativeSelf Organizing Data Analysis Techniques A (ISODATA) by Ball and Hall.

11.5.1 K-means Clustering

Let y1,y2, · · · ,yn be the n observed data points. Let x1,x2, · · · ,xk be initialseed points.The seed points form the nuclei of the clusters C1, C2, · · · , Ck. The


data point yi is put into cluster Cj if

‖yi − xj‖ = mina=1,··· ,k

‖yi − xa‖,

i.e., if yi is closest to the jth seed point. At the end of the first step, we havek clusters: C1, C2, · · · , Ck. It is possible that some clusters are empty and thusthere can be fewer than k clusters. The choice of the initial seed points is criticalin determining clusterings not only in the first stage, but also in the final stage.Average linkage is often used to get the initial seed points, but other choices arepossible.

For each cluster, e.g., Cr, compute the cluster centroid yr and the variance-covariance matrix Sr. Actually, only the mean vector is needed for the k-meansprocedure, but the covariance matrix is needed for the expanded and relatedISODATA procedure. The yr become the new seed points and the observationsare formed into clusters using the above spherical (Euclidean) distances. Thisprocess is iterated until the cluster means do not change.Example (cont): By cutting the average-method dendrogram, the followingpartition results:

> cutree(h.average, 3)

Baltimore New York Philadelphia Pittsburgh Washington1 2 3 3 2

These centroids of these clusters will be used as the seed points for the K-meansmethod. These can be computed by:

> initial.points <- tapply(city.sc, list(rep(cutree(h.average,

+ 3), ncol(city.sc)), col(city.sc)), mean)

> dimnames(initial.points) <- list(NULL, dimnames(city.sc)[[2]])

> initial.points

Murder Rape Robbery[1,] 1.0370576 1.61280999 0.649966[2,] 0.3587496 0.00597337 0.736600[3,] -0.8772784 -0.81237837 -1.061583

> k.means <- kmeans(city.sc, initial.points)

> k.means$cluster

Baltimore New York Philadelphia Pittsburgh Washington1 2 3 3 2

> k.means$centers

Murder Rape Robbery1 1.0370576 1.61280999 0.6499662 0.3587496 0.00597337 0.7366003 -0.8772784 -0.81237837 -1.061583


Not surprisingly, the clusters found by K-means are identical to those found bycutting the average-linkage tree into three grpups.

The clusters can be represented in PCA space.

> city.sc.pca <- princomp(city.sc)

> city.sc.pred <- predict(city.sc.pca)

> city.sc.centers <- predict(city.sc.pca, k.means$centers)

> plot(city.sc.pred[, 1:2], type = "n", xlab = "Canonical PCA 1",

+ ylab = "Canonical PCA 2")

> text(city.sc.pred[, 1:2], labels = k.means$cluster)

> points(city.sc.centers[, 1:2], pch = 3, cex = 3)

−2 −1 0 1 2

−0.

50.

00.

5

Canonical PCA 1

Can

onic

al P

CA

2

1

2

3

3

2

11.5.2 ISODATA

ISODATA (Iterative Self Organizing Data Analysis Technique A) starts as theK-measns procedure. However, the clusters are either split or lumped based onthe distinctness and dispersion of the clusters. Cluster Cr is split if

1. the largest eigenvalue of Sr > θs, or

2. the maximum variance of Sr > θs, i.e., maxi=1,··· ,k s2ir > θs,


where θs is a user-defined splitting parameter. On the other hand, if

‖yr − ys‖ < θl,

then lump Cr and Cs. θl is the user-defined lumping parameter. This processis iterated until no further splitting or lumping occurs.

ISODATA is scale dependent since PCA depends on the scales of measure-ment. Thus the final clustering depends on the units of measurement. Friedmanand Rubin suggest splitting according to the eigenvalues of W−1B to removethe scale dependency.

11.5.3 Assessment

The discreteness of the final clusters can be assessed by:

dr =1nr

nr∑s=1

‖yrs − yr‖,

i.e., dr is the average Euclidean distance of the nr points in cluster r from thethe cluster centroid.Example (cont):

The average Euclidean distanes are:

> (d_r <- sqrt(k.means$withinss)/k.means$size)

[1] 0.0000000 0.2901961 0.5263894

> k.means$size

[1] 1 2 2

Note that cluster 1 only has one observation and thus has a average Euclideandistance of 0. The within variation appears to contradict the variation in thePCA plot. However, only the first two dimensions are plotted.

An overall measure of compactness is given by:

d =1n

l∑r=1

nrdr.

> k.means$size %*% d_r/sum(k.means$size)

[,1][1,] 0.3266342

Chapter 12

Multidimensional Scaling

Multidimensional scaling (MDS) is the problem of representing n objects geome-tircally by n points in a low-dimensional Euclidean space such that the interpointdistances correspond to the dissimilarities (or similarities) of the objects. Thetwo approaches commonly used are:

1. metric MDS

2. nonmetric MDS

The principal hypothesis is that the interpoint distances in the low-dimensionalspace are linearly (classical MDS), non-linearly (Sammon MDS), or monotoni-cally (nonmetric MDS) related to the dissimilarities (or similarities).

Multidimensional scaling is a distance-based method, as is cluster analysis.Either a dissimilarity (D) or similarity (S) matrix is computed from the basicn× p data set, which may or may not be standardized. The

(n2

)dissimilarities

are computed between all pairs of data units. The n×n symmetric dissimilaritymatrix is:

D =

− d12 · · · d1n

d21 − · · · d2n

......

. . ....

dn1 dn2 · · · −

=[djk]

Only the(n2

)dissimilarities in the upper or lower triangular part of D are re-

quired. A p × p similarity matrix can be defined analogously. The idea is torepresent the n points in a q-dimensional space, where q is small, e.g., 2 or 3,such that their interpoint distances correspond to the dissimilarities.

More specifically, consider the following case:

D =

− d12 d13

− d23

−

=

− 1 3− 2−

,i.e., d12 < d23 < d13

94

CHAPTER 12. MULTIDIMENSIONAL SCALING 95

We have three objects which we want to represent in q-dimensional Euclideanspace, where in this case q ≤ 2, such that the interpoint distances in q space, thedij , ‘correspond’ to the observed dissimilarities. The methods described belowvary according to how the dij are determined.

12.1 Metric MDS

12.1.1 Classical MDS

The classical form of MDS is also known as principal co-ordinate analysis. Inclassical MDS, the distances among the n points in the q-dimensional space,computed as dij = ‖xi − xj‖, are chosen to minimize the following stress func-tion:

SClassical(x1,x2, . . . ,xn) =

∑i<j(d

2ij − d2

ij)∑i<j d

2ij

.

This problem can be solved exactly. Starting with D, compute

B = [bij ] = [−12

(d2ij − d2

i. − d2.j + d2

..)],

which is the doubly-centered inner-product matrix (see Everitt). Let l1 > l2 >· · · > lq be the q largest eigenvalues of B and let Vq = [v1,v2, . . . ,vq] bethe corresponding eigenvectors. The coordinates of the points in q-dimensionalspace are then given by:

Xq = VqL1/2,

where L = diag[l1, l2, . . . , lq].One rule for determining the dimensionality q is:

minq

∑qi=1 li∑n−1i=1 li

> 0.8.

Generally, q � (n− 1), the maximum dimensionality.

Iris Example The iris data, consisting of petal and sepal measurements onplants from three species, has group structure. Can classical MDS capture it?The built-in function cmdscale() performs classical MDS. The representationis computed for k = 1, 2, 3.

> library(MASS)

> data(iris)

> iris.vars <- iris[, -5]

> iris.species <- as.factor(iris[, 5])

> dist.c <- dist(iris.vars)

> iris.cmds1 <- cmdscale(dist.c, k = 1, eig = TRUE)




The representation in 2 space is now plotted.

> iris.cmds2$points[, 1] <- -iris.cmds2$points[, 1]

> eqscplot(iris.cmds2$points, type = "n", xlab = "", ylab = "")

> text(iris.cmds2$points, labels = as.character(substring(iris.species,

+ 1, 2)), col = 3 + 2 * unclass(iris.species), cex = 0.5)

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

se

sesese

se

se

se

se

se

se

se

se

se

se

sese

se

se

se

sesese

se sese

se

se

sese

sese

se

se

se

se

se

se

se

se

sese

se

se

se

se

se

se

se

se

se

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

veve

ve

veve

veve

veve

veve

veve

ve ve

ve

ve

veve

veveve

ve

ve

veve

ve ve

ve

ve

ve

ve

veve

ve

ve

ve

vi

vi

vi

vivi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vivi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vivivi

vivi

vi

vivi

vi

vi

vivi

vi

vi

vi

vi

vi

vivivi

vi

vi vi

vi

The 2-dimensional representation shows the same structure as the discriminantfit in two dimensions.

12.1.2 Least Squares MDS

Least squares, or Kruskal-Shepard metric scaling, approximates the distances,dij , not the squared distances, by the low-dimensional representation. TheStress to be minimized is:

SLS(x1,x2, . . . ,xn) =∑i<j

(dij − dij).

12.1.3 Sammon’s MDS

Sammon developed an alternative form of MDS, which is a non-linear mappingof the n points into a q-dimensional space with q � n. The distances among the


n points in the q-dimensional space, computed as dij , are chosen to minimizethe following stress function:

SSammon(x1,x2, . . . ,xn) =∑i<j

(dij − dij)2

dij.

The Sammon function puts more emphasis on reproducing small distances.

Iris Example (cont.) The function sammon() in the MASS package performsSammon MDS. The representation is computed for k = 1, 2, 3.

> dist.s <- dist(iris.vars[-143, ])

> iris.smds1 <- sammon(dist.s, k = 1)

Initial stress : 0.03752stress after 10 iters: 0.02755, magic = 0.500stress after 20 iters: 0.02727, magic = 0.500stress after 30 iters: 0.02719, magic = 0.500


Initial stress : 0.00678stress after 10 iters: 0.00404, magic = 0.500stress after 12 iters: 0.00402


Initial stress : 0.00073stress after 10 iters: 0.00035, magic = 0.500stress after 20 iters: 0.00034, magic = 0.500


> iris.smds2$points[, 1] <- -iris.smds2$points[, 1]

> eqscplot(iris.smds2$points, type = "n", xlab = "", ylab = "")

> text(iris.smds2$points, labels = as.character(substring(iris.species,



−3 −2 −1 0 1 2 3 4

−3

−2

−1

01

23

se

sese

se

se

se

se

se

se

se

se

se

se

se

sese

se

se

se

sese

se

sese

sese

se

sese

sese

se

se

se

se

se

se

se

se

sese

se

se

se

se

se

se

se

se

se

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

veve

ve

veve

ve

ve

ve

ve

veve

ve

ve veve

ve

ve

veve

veve

ve

ve

ve

ve

ve

ve ve

ve

ve

ve

ve

veve

ve

ve

ve vi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vi vi

vivi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vivivi

vi

vi

vi

vivi

vi

vi

vi

vi

vi

vi

vi

vi

vivi

vi

vi

vi

vivi

vi

The 2-dimensional representation shows the same structure as the discriminantfit in two dimensions and a nearly identical representation as classical MDS.

12.2 Nonmetric MDS

Shepard-Kruskal nonmetric scaling is a non-parametric method in that it isbased on ranks. More specifically, a monotone regression is fit. This method isthe most commonly used MDS method. Thus, a more detailed development isprovided.

In nonmetric MDS, we want dij < dkl whenever dij < dkl. This is called themonotonicity condition. For similarities, we want dij < dkl whenever sij > skl.We can always fit the monotonicity constraint exactly in n − 1 dimensions ifthere are n objects. However, the usefulness is in expressing the n objects ina low-dimensional space q. Then the monotonicity constraint is almost neversatisfied for large n..

The extent to which the monotonicity constraint is satisfied can be visualizedin a Shepard diagram. It consists of plotting the dij versus the correspondingdij . If the points are connected, the line segments should go up and to the right.Switch-backs violate the monotonicity condition.

Let’s consider the case in which monotonicity is violated. Suppose

d23 < d12 < d34 < d13 < d24 < d14,


butd23 < d34 < d12 < d13 < d14 < d24.

We want to fit a set of values dij = θ(dij) which satisfy the monotonicitycondition, i.e.,

d23 < d12 < d34 < d13 < d24 < d14.

This is the monotone regression problem and is fit such that the stress function:

SNM(x1,x2, . . . ,xn) =

√√√√∑(dij − dij)2∑d2ij

is minimized, such that the dij satisfy the monotonicity condition.There may not exist n points in q-dimensional space which satisfy the mono-

tonicity constraint. We thus move the points such that monotonicity is im-proved. If dij < dij , then points i and j should be moved closer together. Ifdij > dij , then points i and j should be moved further apart. Based on theseconditions, points 1 and 2 should be moved closer together and points 3 and 4should be moved further apart. Likewise, points 2 and 4 should be moved closertogether and points 1 and 4 should be moved further apart.

The procedure is done iteratively.

1. Find an initial configuration of points in q-dimensional space (often clas-sical MDS is used).

2. Compute the Euclidean interpoint distances, i.e., the dij .

3. Solve the monotone regression, i.e., find the dij .

4. Adjust the points in q-space to better approximate monotonicity.

5. Repeat step 2. to 4. until the stress is minimized.

If the initial configuration is too far off, convergence may not occur in a reason-able number of iterations. Further, a global minimum may not be found.

Iris Example (cont.) The function isoMDS() in the MASS package performsnonmetric MDS. The representation is computed for k = 1, 2, 3.

> dist.nm <- dist(iris.vars[-143, ])

> iris.nmds1 <- isoMDS(dist.nm, k = 1)

initial value 6.941616final value 6.407773converged



initial value 3.024856iter 5 value 2.638471final value 2.579979converged


initial value 0.992692iter 5 value 0.734600final value 0.719514converged


> iris.nmds2$points[, 1] <- -iris.nmds2$points[, 1]

> eqscplot(iris.nmds2$points, type = "n", xlab = "", ylab = "")

> text(iris.nmds2$points, labels = as.character(substring(iris.species,


−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

se

sesese

se

se

se

se

se

se

se

se

se

se

sese

se

se

se

sesese

se sese

se

se

sese

sese

se

se

se

sese

se

se

se

sese

se

se

se

se

se

se

se

se

se

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

ve

veve

ve

veve

veve

veve

veve

veve ve

ve

veve

veve

veve

ve

ve

ve

ve

ve

ve ve

ve

ve

ve

ve

veve

ve

ve

ve vi

vi

vi

vivi

vi

vi

vi

vi

vi

vi

vi

vi

vi vi

vivi

vi

vi

vi

vi

vi

vi

vi

vi

vi

vivi vi

vi

vi

vi

vivi

vi

vi

vivi

vi

vi

vi

vi

vivivi

vi

vivi

vi

vi

The 2-dimensional representation shows the same structure as the discriminantfit in two dimensions. In addition, the nonmetric MDS representation is nearlyidentical to the classical and Sammon MDS representations.


12.3 Assessment and Interpretation of the Fit

12.3.1 Determining Dimensionality

How is q, the dimensionality, determined? In all methods, as q increases, thestress decreases if the initial configurations are chosen reasonably. An intuitiveapproach is to plot the stress, Sq versus q in a scree plot. The dimensionality qis chosen at the elbow point.

The following plot displays the classical measure of stress for k = 1, 2, 3:

> dist.f1 <- dist(iris.cmds1$points)



> S.cl1 <- sum(dist.c^2 - dist.f1^2)/sum(dist.c^2)



> plot(1:3, c(S.cl1, S.cl2, S.cl3), xlab = "Dimension = k", ylab = "Classical Stress")

●

●

●

1.0 1.5 2.0 2.5 3.0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Dimension = k

Cla

ssic

al S

tres

s

The scree plot for classical scaling shows that a 2-dimensional representationprovides a good fit, but k = 3 shows some improvement.


The following plot displays the Sammon measure of stress for k = 1, 2, 3:

> dist.sf1 <- dist(iris.smds1$points)



> S.s1 <- sum((dist.s - dist.sf1)^2)/sum(dist.s)



> plot(1:3, c(S.s1, S.s2, S.s3), xlab = "Dimension = k", ylab = "Sammon Stress")

●

●

●

1.0 1.5 2.0 2.5 3.0

0.00

0.01

0.02

0.03

0.04

Dimension = k

Sam

mon

Str

ess

The scree plot for Sammon scaling shows that a 2-dimensional representationprovides a good fit, whereas k = 3 shows little improvement.

The following plot displays the nonmetric measure of stress for k = 1, 2, 3:

> S.nm1 <- iris.nmds1$stress/100



> plot(1:3, c(S.nm1, S.nm2, S.nm3), xlab = "Dimension = k", ylab = "Nonmetric Stress")


●

●

●

1.0 1.5 2.0 2.5 3.0

0.01

0.02

0.03

0.04

0.05

0.06

Dimension = k

Non

met

ric S

tres

s

The scree plot for nonmetric scaling shows that a 2-dimensional representationprovides a good fit, but k = 3 shows improvement.

12.3.2 Assessing the Fit

A fitted distance matrix, D, can be constructed from the final configuration ofpoints in q space. We can compute ‘residual’ norms

‖D − D‖

and other measures of fit as in cluster analysis.

12.3.3 Interpretation of the Spatial Representation

The spacial representations of the dissimilarities in q space can be enhanced byvarious strategies to facilitate their interpretation. For example,

• use cluster analysis to discover natural groupings;

• rotate the axes to coincide with clusters or to line up with simple structure(projection pursuit);

• transform linearly to emphasize clustering;

• determine axes by optimally correlating with external variables.


12.4 Joint Space Analysis

Appendix A

Vector and Matrix Algebra

A.1 Matrix Algebra

A matrix is a two-dimensional array represented by:

A =

a11 a12 · · · a1m

a21 a22 · · · a2m

......

. . ....

an1 an2 · · · anm

[aij] ,where the subscript ij refers to the element in row i and column j.

Example: The distribution of moose in vegetative areas of Montana in 1958.

> A <- matrix(c(98, 39, 22, 24, 15, 15, 42, 22, 17), 3, 3)

> rownames(A) <- c("Bulls", "Cows", "Cows w/ calves")

> colnames(A) <- c("V1", "V2", "V3")

> A

V1 V2 V3Bulls 98 24 42Cows 39 15 22Cows w/ calves 22 15 17

Basic matrix operations and definitions

1. Matrix addition

C = A + B

is defined by: [cij]

=[aij]

+[bij].

105

APPENDIX A. VECTOR AND MATRIX ALGEBRA 106

The dimensions of A and B must conform.

Example (cont): The corresponding distribution in 1959.

> B <- matrix(c(55, 43, 11, 19, 53, 40, 44, 38, 20), 3, 3)

> rownames(B) <- c("Bulls", "Cows", "Cows w/ calves")

> colnames(B) <- c("V1", "V2", "V3")

> B


The total number of moose over 1958 and 1959.

> A + B


The difference in the numbers of moose between 1958 and 1959.

> A - B

V1 V2 V3Bulls 43 5 -2Cows -4 -38 -16Cows w/ calves 11 -25 -3

2. Scalar multiplication

B = cA, c a scalar

is defined by: [bij]

=[caij

].

3. Matrix multiplication

C = AB

is defined by: [cij]

=[∑m

k=1 aikbkj].

The number of columns of A must equal the number of rows of B.

Example): Number of ewes having lambs in 1952 and 1953.


> A <- matrix(c(58, 52, 1, 26, 58, 3, 8, 12, 9), 3, 3)

> rownames(A) <- c("None", "Single", "Twins")

> colnames(A) <- c("None", "Single", "Twins")

> A

None Single TwinsNone 58 26 8Single 52 58 12Twins 1 3 9

> b <- c(0, 1, 2)

> C <- A %*% b

> colnames(C) <- c("Total")

> C

TotalNone 42Single 82Twins 21

4. Transposition

B = A′

is defined by: [bij]

=[aji].

> t(A)


5. Matrix inverse

The n× n identity matrix is defined by:

In =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

.Let A and B be square matrices of order n. If AB = In, then B is saidto be the inverse of A, and is denoted by A−1. If A−1 exists, A is saidto be nonsingular; otherwise, it is singular.

Properties:


(a) (A−1)′ = (A′)−1;

(b) (A−1)−1 = A.

> (B <- solve(A))

None Single TwinsNone 0.028394485 -0.01226922 -0.00888058Single -0.026641739 0.03003038 -0.01635896Twins 0.005725637 -0.00864688 0.11755083

> round(A %*% B, 5)


> round(t(solve(A)), 5)

None Single TwinsNone 0.02839 -0.02664 0.00573Single -0.01227 0.03003 -0.00865Twins -0.00888 -0.01636 0.11755

> round(solve(t(A)), 5)

None Single TwinsNone 0.02839 -0.02664 0.00573Single -0.01227 0.03003 -0.00865Twins -0.00888 -0.01636 0.11755

6. Submatrices

The matrix formed from A by deleting rows and columns is called a sub-matrix of A.

> (A11 <- A[1:2, 1:2])

None SingleNone 58 26Single 52 58

7. Determinant

The determinant is a scalar function defined on a square n × n matrix.For the ith row of A (i is arbitrary):

det(A) =n∑j=1

(−1)i+jaijdetAij


where Aij is the square submatrix found by deleting the ith row and thejth column.

Properties:

(a) det(AB) = det(A)det(B) for A, B square;

(b) det(A′) = det(A);

(c) det(cA) = cndet(A) for c a scalar;

(d) det(A) 6= 0 ⇐⇒ A is nonsingular.

> det(A)

[1] 17116

8. Block matrices

An r × c block matrix of A is defined by:

A =

A11 A12 · · · A1c

A21 A22 · · · A2c

......

. . ....

Ar1 Ar2 · · · Arc

=[Aij

],

where Aij is ni×mj . The submatrices must conform by rows and columns.A common case is:

A =[A11 A12

A21 A22

].

9. Special matrices

The square matrix A is:

(a) diagonal if aij = 0 whenever i 6= j;

(b) upper triangular if aij = 0 whenever i > j;

(c) lower triangular if aij = 0 whenever i < j;

(d) symmetric if A′ = A;

(e) positive definite if x′Ax > 0 for x 6= 0;

(f) positive semi-definite if x′Ax ≥ 0;

(g) orthonormal if A′A = In;

(h) idempotent if A2 = A.

Example: Matrix operations are illustrated using the explanatory variables forpredicting air pollution for each of 60 U.S. metropolitan areas (Henderson andVelleman, 1981).


> library(mult)

> data(airpoll)

> X <- as.matrix(cbind(1, airpoll[, c(1, 2, 4, 6)]))

> y <- airpoll[, 7]

> (n <- nrow(X))

[1] 60

> (p <- ncol(X))

[1] 5

> cbind(X, y)[1:5,]

1 Rainfall Education Nonwhite SO2 yakronOH 1 36 11.4 8.8 59 921.9albanyNY 1 35 11.0 3.5 39 997.9allenPA 1 44 9.8 0.8 33 962.4atlantGA 1 47 11.1 27.1 24 982.3baltimMD 1 43 9.6 24.4 206 1071.0

The y variable is ‘Mortality.’

The following matrix operations in R illustrate matrix transposition and matrixmultiplication:

> (xtx <- t(X) %*% X)

1 Rainfall Education Nonwhite SO21 60.0 2242.0 658.40 712.20 3226.0Rainfall 2242.0 89658.0 24358.00 28784.10 116552.0Education 658.4 24358.0 7267.00 7722.32 34659.1Nonwhite 712.2 28784.1 7722.32 13149.44 43607.5SO2 3226.0 116552.0 34659.10 43607.50 410534.0

The inverse of a matrix can be computed by the solve function:

> round(xtxi <- solve(xtx), 7)

1 Rainfall Education Nonwhite SO21 6.1148373 -0.0281618 -0.4533659 0.0040268 -0.0022081Rainfall -0.0281618 0.0002895 0.0016439 -0.0001156 0.0000126Education -0.4533659 0.0016439 0.0352602 -0.0002245 0.0001429Nonwhite 0.0040268 -0.0001156 -0.0002245 0.0002719 -0.0000087SO2 -0.0022081 0.0000126 0.0001429 -0.0000087 0.0000051

We next compute the estimates required for a regression analysis.


> b <- xtxi %*% t(X) %*% y

> y_hat <- X %*% b

> res <- y - y_hat

> fivenum(res)

[1] -94.958710 -21.133279 -2.494249 17.701435 92.205055

The estimate of σ is given by:

> (sigma_hat <- sqrt(sum(res^2)/(n - p)))

[1] 37.2685

The computed R2 is:

> (R_2 <- 1 - sum(res^2)/sum((y - mean(y))^2))

[1] 0.6654661

The summary table for the model is now computed.

> b_se <- sigma_hat * sqrt(diag(xtxi))

> t <- b / b_se

> p_val <- 2 * (1 - pt(abs(t), n - p))

> coef <- cbind(b, b_se, t, p_val)

> colnames(coef) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")

> coef

Estimate Std. Error t value Pr(>|t|)1 998.4570517 92.15827529 10.834155 2.886580e-15Rainfall 1.6147726 0.63412694 2.546450 1.371222e-02Education -15.7058089 6.99816250 -2.244276 2.885850e-02Nonwhite 3.0595934 0.61453517 4.978712 6.680911e-06SO2 0.3275942 0.08393961 3.902737 2.617222e-04

Compare this with the summary table from the R function lm:

> s.airpoll <- summary(lm (Mortality ~ Rainfall + Education + Nonwhite + SO2, data=airpoll))

> library(xtable)

> xtable(s.airpoll)

A.2 Vector Algebra

A vector is a one-dimensional array defined by

x =

x1

x2

...xn

Basic vector operations


Estimate Std. Error t value Pr(>|t|)(Intercept) 998.4571 92.1583 10.83 0.0000

Rainfall 1.6148 0.6341 2.55 0.0137Education -15.7058 6.9982 -2.24 0.0289Nonwhite 3.0596 0.6145 4.98 0.0000

SO2 0.3276 0.0839 3.90 0.0003

1. Inner product

The inner product of x and y, both n× 1, is:

x′y =n∑i=1

xiyi = y′x.

2. Outer product

The outer product of x, n× 1, and y, m× 1, is:

xy′ =

x1y1 x1y2 · · · x1ymx2y1 x2y2 · · · x2ym

......

. . ....

xny1 xny2 · · · xnym

.Let a1,a2, . . . ,am be vectors of length n. For scalars s1, s2, . . . , sm,

a = s1a1 + s2a2 + · · ·+ smam

is said to be a linear combination of a1,a2, . . . ,am. Let

V = {a |a = s1a1 + s2a2 + · · ·+ smam}.

V is called a vector space spanned by a1,a2, . . . ,am.The vectors a1,a2, . . . ,am are said to be linearly dependent if there exists

scalars s1, s2, . . . , sm, not all zero, such that:

s1a1 + s2a2 + · · ·+ smam = 0.

Otherwise the vectors are said to be linearly independent. The dimension of Vis the maximal linearly independent subset of a1,a2, . . . ,am.

LetA =

[a1 a2 · · · am

],

i.e., the columns of the matrix A are the vectors a1,a2, . . . ,am. The range ofA is:

R(A) = {y |y = Ax for some x},whereas the nullity of A is:

N(A) = {x |Ax = 0}.


The rank of A is defined by:

rank(A) = dim[R(A)].

Also, rank(A) = rank(A′).The following are equivalent if n = m:

1. A is nonsingular;

2. N(A) = {0};

3. rank(A) = n.

Vector norms introduce the concept of length to vectors. The p-norms aredefined by:

‖y‖p = (|y1|p + |y2|p + · · ·+ |yn|p)1/p.

Special cases are:

‖y‖1 = |y1|+ |y2|+ · · ·+ |yn|, (the city-block norm),

‖y‖2 = (|y1|2 + |y2|2 + · · ·+ |yn|2)1/2, (the Euclidean norm),

and

‖y‖∞ = max(|yi|), (the maximum norm).

A.3 Matrix Decompositions

QR DecompositionLet X be n× p. Then there exists matrices Qn×r and Rr×p such that:

X = QR

where r = rank(X), Q has unit orthogonal (orthonormal) columns, i.e., Q′Q =In, which span the same space as the columns of X, and R is upper triangular.This is the basic decomposition for solving linear systems of equations. TheGram-Schmidt decomposition is the most common form of the QR decomposi-tion.LU Decomposition

Let A be a square matrix of order n. Then there exist square matrices Land U such that:

A = LU

where L is lower triangular with 1s on the diagonal and U is upper triangular.The LU decomposition is often used to solve a linear system of n equations inn unknowns.Singular Value Decomposition


Let X be n× p. Then there exists an n× n orthonormal matrix U, a p× porthonormal matrix V, and an n× p diagonal matrix D with positive diagonalelements d1 ≥ d2 ≥ · · · ≥ ds, where s = min(n, p), such that:

X = UDV′.

This is called the singular value decomposition (SVD) of X and the di arecalled the singular values. The columns of U are the left singular vectors andthe columns of V are the right singular vectors.

> (X <- airpoll[, c(1, 2, 4, 6)])[1:5,]

Rainfall Education Nonwhite SO2akronOH 36 11.4 8.8 59albanyNY 35 11.0 3.5 39allenPA 44 9.8 0.8 33atlantGA 47 11.1 27.1 24baltimMD 43 9.6 24.4 206

> Xd <- svd(X)

> Xd$d

[1] 676.69842 242.10704 59.23479 23.73007

> Xd$u[1:5, ]

[,1] [,2] [,3] [,4][1,] -0.10146829 -0.06616441 0.05787963 -0.04940925[2,] -0.07231475 -0.08486303 0.13244188 -0.04012457[3,] -0.06752523 -0.12186414 0.21074986 0.11447733[4,] -0.06102096 -0.17506520 -0.20036464 0.01958718[5,] -0.31101951 0.10209739 -0.12554200 0.13396598

> Xd$v

[,1] [,2] [,3] [,4][1,] -0.31204604 -0.8741930 0.25799994 0.26804841[2,] -0.09102003 -0.2219586 0.13643837 -0.96116299[3,] -0.11382151 -0.2614156 -0.95631217 -0.06460317[4,] -0.93882229 0.3437774 0.01695999 0.01192438

> (Xd$u %*% diag(Xd$d) %*% t(Xd$v))[1:5, ]

[,1] [,2] [,3] [,4][1,] 36 11.4 8.8 59[2,] 35 11.0 3.5 39[3,] 44 9.8 0.8 33[4,] 47 11.1 27.1 24[5,] 43 9.6 24.4 206


Eigenvalue DecompositionLet P = X′X, i.e., P is the p× p cross-product matrix of the columns of X.

Then by applying the SVD to X:

P = VDU′UDV′ = VD2V′ = VLV′

where L is a diagonal matrix with diagonal elements lj = d2j . This is the eigen-

value decomposition of P. The lj are the eigenvalues of P and the columnsof V (the vj) are the eigenvectors. Notice that the eigenvectors of X′X arethe right singular vectors of X and the eigenvalues are the squares of thecorresponding singular values. The matrix of eigenvectors is orthogonal (i.e.,v′ivj = 0 for i 6= j) or orthonormal if the eigenvectors are scaled to have unitlength (assumed here).

Since V is orthonormal, this can be written as:

PV = VL.

Eigenvalues and eigenvectors are defined for general P, not only for P which arepositive semi-definite (i.e., X′X is positive semi-definite). The representationmore commonly seen is:

Pvj = ljvj .

> X <- as.matrix(X)

> XtX <- t(X) %*% X

> Xe <- eigen(XtX)

> Xe$val

[1] 457920.7467 58615.8171 3508.7601 563.1161

Note that the eigenvalues are the squares of the singular values.

> sqrt(Xe$val)

[1] 676.69842 242.10704 59.23479 23.73007

Also, the eigenvectors are the right singular vectors.

> Xe$vec

[,1] [,2] [,3] [,4][1,] -0.31204604 -0.8741930 -0.25799994 0.26804841[2,] -0.09102003 -0.2219586 -0.13643837 -0.96116299[3,] -0.11382151 -0.2614156 0.95631217 -0.06460317[4,] -0.93882229 0.3437774 -0.01695999 0.01192438

The eigenvalue decomposition can be generalized. Let P1 be positive semi-definite and let P2 be positive definite. Then:

P1vj = ljP2vj


is called the generalized eigenvalue problem. This can be solved by:(P−1

2 P1

)vj = ljvj ,

but it is inefficient since P−12 P1 is, in general, not symmetric.

Square Root DecompositionLet P be positive semi-definite. Then there exists a symmetric matrix P1/2

such that:P = P1/2P1/2.

This follows from the eigenvalue decomposition:

P = VDV′

= VD1/2V′VD1/2V′

= P1/2P1/2.

That is, P1/2 = VD1/2V′. Likewise, P−1/2 = VD−1/2V′.The square root decomposition can be used to solve the generalized eigen-

value problem.Cholesky Decomposition

Let P be positive semi-definite. Then there exists an upper triangular matrixU such that:

P = U′U.

This is the Cholesky decomposition of P. It can be used to solve the generalizedeigenvalue problem. Note that:

P1vj = ljP2vj = ljU′Uvj ,

where U′U is the Cholesky decomposition of P2. Equivalently:

U−1′P1U−1Uvj = ljUvj .

Let P = U−1′P1U−1 and wj = Uvj . Then the problem has become:

Pwj = ljwj .

Once this eigenproblem is solved, then backsolve by:

vj = U−1wj .

Note that the eigenvalues of the transformed problem remain the same.

Appendix B

Distribution Theory

B.1 Random Variables and Probability Distri-butions

Variables were discussed in Chapter 1. Since individuals in a population areselected randomly (although this assumption is often not satisfied), the valueof the variable is not known in advance. Likewise, the probability Y takes on aspecific value, i.e., P (Y = y), is also unknown. What is known?

Often we can assume that we know the form of the distribution of Y . Denotethe distribution of Y by f(y;θ), where f is known, but θ is unknown. Thisdetermines a family with each specific value of θ determining a member of thefamily. The function f is called a probability density function if Y is continuousand a probability distribution if Y is discrete, and θ is the parameter of thedistribution.

The above f is a univariate distribution for a single random variable Y . Themore general case is for Y ′ =

[y1 y2 · · · yp

]. The multivariate probability

density (distribution) is given by f(y;θ).The initial steps of modeling involve specifying the model and fitting the

model. This needs to be translated to the above context. The characteristicsof the variables often provide the information for selecting the form of f . Forexample, if Y is a vector of continuous random variables, the multivariate normalis a reasonable starting place. Once a decision has been made, the data can beused to assess this assumption. It may be necessary to revise the basic model.

Once a model is specified (i.e., f is selected), the next problem is to estimateθ, the vector of unknown parameters. Least squares and maximum likelihoodestimation are the principal methods used in this book. Robust versions of theseestimators are also used if the true underlying probability distribution deviates“somewhat” from the hypothesized family.

The next section introduces the important normal family. Most of the modelsdeveloped in this book are based on the multivariate normal or the conditionalnormal distribution. The following section discusses several other continuous

117

APPENDIX B. DISTRIBUTION THEORY 118

Figure B.1: The Univariate Normal Density Function

distributions—most of which are derived from the normal. These distributionsform the basis for statistical inference or for assessing distributional assumptions.

B.2 The Normal Family

The univariate normal probability density function is given by:

f(y;µ, σ2) =1√2πσ

e−12 ( y−µσ )2 , −∞ < y <∞.

f(y;µ, σ) is symmetric about µ, the population mean. The mode occurs at µ,where f attains a maximum value of 1√

2πσ. µ ± σ are the points of inflection,

where σ is the population standard deviation. Graphically f has the bell shapedform as shown in Figure B.1.

LetZ =

Y − µσ

.

Z is a random variable which is normally distributed with a mean of 0 and astandard deviation of 1. Thus, the probability distribution of Z is

f(z) =1√2πe−

12 z

2,

which is called the standard normal distribution. Z is a linear transformationof Y which changes the location and scale of Y . It is often called the Z-scoretransform.

The bivariate generalization is given by

f(y1, y2;µ1, µ2, σ21 , σ

22 , ρ)

=1

2πσ1σ2

√1− ρ2

e[− 1

2(1−ρ2){( y1−µ1

σ1)2+(

y2−µ2σ2

)2−2ρ(y1−µ1σ1

)(y2−µ2σ2

)}], (B.1)

where µi and σi are the mean and standard deviation of Yi (i = 1, 2) and ρ is thecorrelation coefficient. The surface represents a bell-shaped mound dependingon ρ and σ1

σ2. The standardized form is given by

f(z1, z2) =1

2π√

1− ρ2e

[− 12(1−ρ2)

{z21+z22−2ρz1z2}]


Figure B.2: Equiprobability Concentration Ellipses

Figure B.3: Probability Ellipses

The isodensity contours are found by varying c, the value of the quadraticform in the exponent of Equation B.1. That is,

(y1 − µ1

σ1)2 + (

y2 − µ2

σ2)2 − 2ρ(

y1 − µ1

σ1)(y2 − µ2

σ2) = c.

Concentric ellipses for several values of c are shown in Figure B.2.Figure B.3 shows how the ellipse depends on ρ and σ1

σ2. If ρ = ±1 the ellipse

becomes a straight line.The multivariate normal probability distribution is given by

f(y;µ,Σ) =1

(2π)p/2|Σ|1/2e−

12 (y−µ)′Σ−1(y−µ),

where µ′ =[µ1 µ2 · · · µp

]is the population mean vector and

Σ =

σ2

1 σ12 · · · σ1p

σ21 σ22 · · · σ2p

......

. . ....

σp1 σp2 · · · σ2p

is the covariance or dispersion matrix. The multivariate normal probabilitydistribution is denoted by

Y ∼ N(µ,Σ).

The quadratic form

(y − µ)′Σ−1(y − µ) = c

defines a p-dimensional equiprobability ellipsoid.


Properties of the multivariate normal distribution:

In the following, let Y ∼ N(µ,Σ). Partition Y ′ =[Y ′1 Y ′2

], where Y 1 has

dimension p− q and Y 2 dimension q. Similarly partition

µ =[µ1

µ2

]and Σ =

[Σ11 Σ12

Σ21 Σ22

].

1. V = AY ∼ N(Aµ,AΣA′).

The form of the mean and covariance matrix does not depend on thenormality assumption. That is,

E(V ) = E(AY ) = AE(Y ) = Aµ

Var(V ) = E[V − E(V )][V − E(V )]′

= AE[Y − E(Y )][Y − E(Y )]′A′

= AΣA′

for any random variable Y .

A special case is the linear combination

V = a′Y =p∑i=1

aiYi.

Then,V ∼ N(a′µ,a′Σa),

i.e., if Y is multivariate normal, then any linear combination of Y isnormal.

2. Y1 ∼ N(µ1,Σ11).

More generally, the marginal distribution of Y is multivariate normal forany subset of the Yj with means, variances, and covariances found bytaking the corresponding components of µ and Σ.

3. Y 1|Y 2 = y2 ∼ N(µ1 + Σ12Σ−122 (y2 − µ2),Σ11 −Σ12Σ−1

22 Σ21).

B = Σ12Σ−122 is called the matrix of regression coefficients of Y 1 on y2.

E(Y 1|y2) = µ1 + B(y2 − µ2)

is called the regression function. Note that E(Y 1|y2) is linear in y2.

Σ11·2 = Σ11 − Σ12Σ−122 Σ21 is called ths partial covariance matrix. Note

that Σ11·2 does not depend on the y2. This is an important property andforms an important aspect of regression theory.


The bivariate case (p = 2; q = 1) corresponds to simple linear regression.

E(Y1|y2) = µ1 +σ12

σ22

(y2 − µ2)

= µ1 + ρσ1

σ2(y2 − µ2)

and

Var(Y1|y2) = σ21 −

σ212

σ22

= σ21 − σ2

1ρ2

= σ21(1− ρ2)

When q = p − 1, the conditional model reduces to multiple regression.In this case, the regression coefficients β′ = σ′12Σ

−122 form a vector. The

regression function is

E(Y |y2) = µ1 + β′(y2 − µ2).

B.3 Other Continuous Distributions

Let

X2 =p∑i=1

Z2i ,

where Z1, Z2, . . . , Zp are p independent standard normal random variables. X2

is said to have a central chi-squared distribution with p degrees of freedom, i.e.,X2 ∼ χ2

p.Now let Y ∼ N(µ,Σ). Then

(Y − µ)′Σ−1(Y − µ) ∼ χ2p.

The chi-square has the reproductive property, i.e., the sum of independent chi-square random variables is a chi-square.

The central chi-square has a single parameter p. This distribution is a specialcase of the gamma family. The probability density function of the gamma isgiven by

f(y;λ, η) =1

Γ(η)ληyη−1e−y/λ, x > 0

where λ > 0 and η > 0.The chi-square is a special case of the gamma in which η = p/2 and λ = 2.

For the general case, λ is called the scale parameter and η is called the shapeparameter. The chi-square and gamma distributions are used extensively toassess distributional assumptions. The chi-square is also used to make statisticalinferences, particularly for categorical variables.

The t-distribution with p degrees of freedom is defined by

t =Z√X2/p

where Z ∼ N(0, 1) and X2 ∼ χ2p and Z and X2 are independent. The t-

distribution is used to make inferences about coefficients in conditional models.


The central F -distribution is defined by

F =X2

1/r

X22/s

whereX21 andX2

2 are independent random variables withX21 ∼ χ2

r andX22 ∼ χ2

s.The F -distribution is used to make statistical inferences, particularly in analysisof variance models.

B.4 Maximum Likelihood and Related Estima-tors

Appendix C

Statistical Software

The S system introduced many of the ideas that are now commonplace in sta-tistical computing environments. These include a command line interface, astatistical language, foreign function interfaces, and a rich function library.

The emergence of graphical user interfaces has now changed the methodin which data analysis is done. For example, Xgobi extends S into the realmof interactive statistical graphics. Although function programming is straight-forward in S , developing a graphical interface and statistical graphics modelsare not. Consequently, researchers are likely to be limited to applying sta-tistical functions to data without the benefits of dynamically-linked plots andinteractive models. While these latter features may be unnecessary for certainapplications, their use is essential in multivariate analyses.

C.1 Lisp-Stat

Lisp-Stat is a new statistical programming environment which runs under theMacOS, X11 , and Windows. It contains features for quickly prototyping ad-vanced applications involving statistical computing and dynamic graphics, in-cluding interface elements. Specifically, Lisp-Stat contains:

• a prototype-based object language

• multiple inheritance

• a rich object library

• foreign function interfaces

• functional data support

• a vectorized arithmetic system

• integrated data analysis, graphics, and programming facilities

123

APPENDIX C. STATISTICAL SOFTWARE 124

Figure C.1: The Model-View-Controller Paradigm

• facilities for acquiring, representing, and processing knowledge

This book uses a data analysis and graphics system called StatObjects, whichis based on the Lisp-Stat environment. The graphical interface is integral to theenvironment, particularly the user interactions with the session log and databrowsers.

C.2 StatObjects

StatObjects follows the model-view-controller paradigm pioneered in the SmallTalkdevelopment environment and popularized by the Macintosh. Every statisticalobject created by a constructor function, e.g., a principal component or discrim-inant instance, is actually an extended dataset (model) which has dependencyrelationships to its higher-level datasets. A dataset is said to be extended inthe sense of Thisted [?], i.e., it is a data-analytic artifact which has pointers tothe original variables, model-derived variables, state variables, numerical sum-maries, and dependency views.

A view (or subview) is an object which visualizes a dataset. Generally,datasets have multiple views: iconic, data browser, textual, and graphical. Thecanonical or standard representation of a dataset is its iconic view. Dataset iconsare actually subviews of the session log, which shows the hierarchical structureof related datasets and provides an analysis history mechanism. The controllerfor this subview is a popup menu for sending messages to the dataset object.

Datasets generally have one textual view, which summaries the numericalinformation about the dataset, and many graphical views. Graphical-view con-structors actually create a new dataset object which in turn is visualized. Manyviews are available or are being developed. These include various types of quan-tile, conditioning, and multivariate plots.

Graphical views have two types of controllers (input devices): methods builtinto the view itself or object instances of controller prototypes. The lattercontrollers are reusable components which are available to any graphical viewobject. Although views and controllers have explicit links to each other andto their dataset object, datasets do not have explicit links to their views andcontrollers (See Figure ??). However, implicit links must exist, since views mustbe given the opportunity to reflect changes made to their underlying datasetobject. Protocols determine how views are updated.

The above discussion is a brief summary of the underlying statistical objectsin StatObjects. A more complete description is given in Galfalvy.

APPENDIX C. STATISTICAL SOFTWARE 125

[?] [?] [?] [?] [?] [?] [?]

Documents

Modeling Multivariate Data