41
Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana Boza

Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

  • Upload
    others

  • View
    31

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Multivariate Ordination Analyses: Principal Component Analysis 

Dilys Vela

Tatiana BozaTatiana Boza

Page 2: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Multivariate AnalysesMultivariate Analyses

A multivariate data set includes more thani bl d d f b fone variable recorded from a number of

replicate sampling or experimental units,i f d bjsometimes referred to as objects.

Page 3: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

If these objects areIf these objects are organisms, the variables might be morphological g p gor physiological measurements

If the objects are ecological samplingecological sampling units, the variables might be gphysicochemical measurements or species abundances

Page 4: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

What ordinations analyses  are ?

Ordination is arranging items along a scale (axis) orl i l Th d f di i imultiples axes. The proposed of ordination is

summarized graphically complex relationships,extracting one or few dominant patterns from anextracting one or few dominant patterns from aninfinite number of possible patterns.

The placement of variables along an axis it is possiblebecause the ordination it is base on the variablesbecause the ordination it is base on the variablescorrelation.

Page 5: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

What ordination analyses help us to ?see?

Select the most important variables from multipleSelect the most important variables from multiple variables imagined or hypothesized.

Reveal unforeseen patterns and suggest unforeseen processes.p

Page 6: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

What type of question can we answer with ordination analysis?with ordination analysis?

In ecology, to seek and describe pattern of process. 

In community ecology, to describe the strongest patterns in species composition.

I i i d d fi iIn systematics, to recognize and to define species boundaries. 

Page 7: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Multivariate Analysis

Ordination Analysis Clasification (or Clustering Analysis)

Direct Gradient Analysis

Indirect  Gradient Analysis

Linear Regression  

(Few Species)

Correspondence Analysis (CA) (Many Species)

Distant 

Detrended CA (DCA)

Canonical CA (CCA)

Redundancy Analysis (RDA)

Values

P i i l N t i

Raw Data available

Principal Coordinate Analysis (PCoA

Non‐metric Dimensional Analysis (NMDS)  Principal 

Components Analysis (PCA)

Non‐metric Dimensional Analysis (NMDS)

Detrended CA (DCA)

Canonical  CA (CCA)

(PCA) (NMDS) 

Page 8: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Principal Components AnalysisPrincipal Components Analysis

Principal component analysis (PCA) is a statistical p p y ( )technique that has been specifically developed to address data reduction.In general terms the major aim of PCA is to reduce theIn general terms, the major aim of PCA is to reduce the complexity of the interrelationships among a potentially large number of observed variables to a relatively small 

b f l b f h h hnumber of linear combinations of them, which are referred to as principal components.Principal components analysis finds a set of orthogonalPrincipal components analysis finds a set of orthogonal standardized linear combinations which together  explain all of the variation in the original data. 

Page 9: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

What are the assumptions of PCA?What are the assumptions of PCA?

Assumes relationships among variablesAssumes relationships among variables.cloud of points in p‐dimensional space has linear dimensions that can be effectively summarized by the principal axes.

If the structure in the data is NONLINEAR (the cloud f d h hof points twists and curves its way through p‐

dimensional space), the principal axes will not be an efficient and informative summary of the dataefficient and informative summary of the data.

Page 10: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Considerations before to run a PCAConsiderations before to run a PCA

Normal DistributionsNormal Distributions

Data Outliers

f iTransformations

Standardization

Data Matrix

Page 11: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Normal DistributionsNormal Distributions

• When using PCA data normality is notWhen using PCA data normality is not essential. However, these methods are based on the correlation or covariance matrix whichon the correlation or covariance matrix, which is strongly affected by non‐normally distributed data and the presence of outliersdistributed data and the presence of outliers.

Page 12: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Data outliersData outliers

• Extreme values as well as outliers can have aExtreme values as well as outliers can have a severe influence on PCA, since they are based on the correlation or covariance matrix (Pisonet al., 2003).

• Outliers should thus be removed prior to the statistical analysis, or statistical methods able to handle outliers should be employed, and h i fl f l d bthe influence of extreme values needs to be reduced (e.g., via a suitable transformation).

Page 13: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

TransformationsTransformationsTransformations, which change the scale of measurement of the data in relation to meeting the normality assumption ofthe data, in relation to meeting the normality assumption of parametric analyses and the homogeneity of variance assumption of most of these analyses.

Transformations are particularly important for multivariate procedures based on eigenanalysis (e.g. principal components analysis) because covariances and correlations measure linearanalysis) because covariances and correlations measure linear relationships between variables.

Transformations that improve linearity will increase theTransformations that improve linearity will increase the efficiency with which the eigenanalysis extracts the eigenvectors.

Page 14: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

StandardizationStandardization

The first stage in rotating the data cloud is toThe first stage in rotating the data cloud is to standardize the data by subtracting the mean and dividing by the standard deviationand dividing by the standard deviation. 

It may be argued that we should not divide by the standard deviation By standardizing wethe standard deviation. By standardizing, we are giving all species the same variation, i.e. a standard deviation of 1standard deviation of 1. 

Page 15: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Data MatrixData Matrix

We actually can have it both ways:We actually can have it both ways: A PCA without dividing by the standard deviation is an analysis of the covariance matrix. A PCA in which you do indeed divide by the standard deviation is an analysis of the correlation matrixmatrix.

When using species/variables measured inWhen using species/variables measured in different units, you must use a correlation matrixmatrix. 

Page 16: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Look at Descriptors

Homogeneous nature?

All Same Kind ?

Same Units?

Heterogenous nature?

Different kind?

Different Units?

Same Order of Magnitude Different order of Magnitude?

S matrix

(Covariance)

R matrix

(Correlation)( ) ( )

Page 17: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Advantages Disadvantages

Correlation  The results of  There are considerable differences in the Matrix analyses for 

different sets of random variables are more directly

standard deviations, caused mainly bydifferences in scale.None of the correlations is particularly large in 

absolute valueare more directly comparable.

absolute value.PCs has moderate‐sized coefficients for several 

of the variables.PCs give coefficients for standardized variables 

and are therefore less easy to interpret directly.

CovarianceMatrix

PCs for the covariance matrix

The sensitivity of the PCs to the units of measurement used for each element of theMatrix covariance matrix 

are each dominated by a single variable.The variances and 

measurement used for each element of thevariables. If there are large differences between the variances of the elements of the variables, then those variables whose variances are largest 

total variance are more meaningful indices for measuring variability

will tend to dominate the first few PCs.

measuring variability in data sets that are symmetric.

Page 18: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Eigenvalues & EigenvectorsEigenvalues & Eigenvectors

The eigenvectors are the loadings of theThe eigenvectors are the loadings of the principal components spanning the new PCA coordinate systemcoordinate system. 

The amount of variability contained in each principal component is expressed by theprincipal component is expressed by the eigenvalues which are simply the variances of the scoresthe scores.

Page 19: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

PCA searches for the direction in the multivariate space thatin the multivariate space that contains the maximum variability. This is the direction of the first principal component (PC1). The second principal p pcomponent (PC2) has to be orthogonal (perpendicular) to PC1 and will contain thePC1 and will contain the maximum amount of the remaining data variability. S b t i i lSubsequent principal components are found by the same principle.

Page 20: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Biplots

A biplot is a visualization tool to t lt f PCA Th PCApresent results of PCA. The PCA 

biplot is called the scaling process. 

The loadings(arrows) represent the elements. The lengths of the arrows i h l di l i lin the plot are directly proportional to the variability included in the two components (PC1 and PC2) displayed, and the angle between any two arrows is a measure of the correlation between those variablescorrelation between those variables.

Page 21: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

MisconceptionsMisconceptions

PCA cannot cope with missing values (butPCA cannot cope with missing values (but neither can most other statistical methods).

It does not require normalityIt does not require normality.

It is not a hypothesis test.

There are no clear distinctions between response variables and explanatory variables.

Page 22: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

When should PCA be used?When should PCA be used?

In community ecology PCA is useful forIn community ecology, PCA is useful for summarizing variables whose relationships are approximately linear or at least monotonicapproximately linear or at least monotonic.

e.g. A PCA of many soil properties might be used to extract a few components that summarize main dimensions of soil variation

PCA is generally NOT useful for ordinatingcommunity data. Why?  Because relationships among species are highly nonlinear.

Page 23: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Community trendsCommunity trends along environmenalgradients appear as 

Beta Diversity 2R - Covariance

g pp“horseshoes” in PCA ordinations.

2

None of the PC axes effectively 

i h dA

xis

summarizes the trend in species composition along Axis 1composition along the gradient.

Page 24: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

The “Horseshoe”EffectThe  Horseshoe Effect

Curvature of the gradient and the degree of infolding ofCurvature of the gradient and the degree of infolding of the extremes increase with beta diversity.

PCA ordinations are not useful summaries ofPCA ordinations are not useful summaries of community data except when beta diversity is very low

Using correlation generally does better than covariance.This is because standardization by species improves the correlation between Euclidean distance and environmental distancedistance. 

Page 25: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

What if there’s more than one d l l l d ?underlying ecological gradient?

When two or more underlying gradients with high beta diversity a “horseshoe” is usuallyhigh beta diversity a  horseshoe  is usually not detectable.

Interpretation problems are more severeInterpretation problems are more severe.

Page 26: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Data Set

Page 27: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Morphological and anatomicalMorphological and anatomical variation of Calophyllum L. 

(Calophyllaceae) in South America. 

D. Vela

Page 28: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Kielmeyeroideae

Calophylleae

Caraipa

Calophylleae•Calophyllum•Neotatea•Marila•Marila•Mahurea•Clusiella•Kielmeyera•Caraipa•Haploclathra•Poeciloneuron•MesuaMesua•Kayea•Mammea

Kayea

Endodesmieae•Endoodesmia•Lebrunia

CalophyllumStevens, 2006

Page 29: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Wurdarck & Davis (2009)

Page 30: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Distribution of Calophyllaceae

144 144 speciesspecies

259259speciesspecies

1010speciesspecies

176176speciesspecies

Stevens, 2006http://www.mobot.org/MOBOT/research/APweb/

Page 31: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

www.wikimedia.org

Page 32: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

VeinResin canal

Page 33: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys
Page 34: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys
Page 35: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

http://www.botany.hawaii.edu/faculty/carr/images/cal_ino.jpghttp://pakuwon.wordpress.com/2009/02/13/nyamplung‐calophyllum‐inophyllum/

http://www.flickr.com/photos/mauroguanandi

Calophyllum brasiliense

Page 36: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

• There is infraspecific variation in• There is infraspecific variation in tepal number between individuals of the same species, and between flowers from the same inflorescence.flowers from the same inflorescence.

Stevens (1974,1980)

Page 37: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Calophyllum brasiliense

http://www.nationaalherbarium.nl/sungaiwain/Calophyllum pisiferum

Calophyllum lanigerum

Page 38: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

1 M i bj ti1. Main objective1.A To distinguish species limits of Calophyllum

in South America. 

2 Specific objectives2. Specific objectives 2.A To analyze morphological and anatomical 

i tivariation. 

Page 39: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Data collection for morphological observationsHerbarium and personalHerbarium  and personal 

collections.

Collection sort: qualitative characteristics  (Systematic Association Committee forAssociation Committee for descriptive Biological Terminology (cited by Stearn2006).

Measurement. Ruler and a digital g

caliper.

E l d t t iExcel data matrix .Specimen collections in rows and variables 

in columns. 

Page 40: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

Leaf characters Flower characters Fruit charactersExternal Fruit length mm 

Petiole length mm (PTL) Pedicel length mm (PDL)g

(FrLEx) 

Leaf length cm (LL) Perianth width mm (PW )External Fruit width mm  (FrWEx) 

L f l th t id t tLeaf length at widest part cm (LWWP) Perianth length mm (PRL) Internal Fruit length mm (FrLIn) 

Leaf width cm (LW) Anther length mm (AL)Internal Fruit width mm  (FrWIn) ( ) g ( ) ( )

Apex length mm (PL) Anther width mm (AW) Stigma remained mm (StygR) Midrib width at abaxial side mm(MW) Stamen length mm (STL) Basal discoloration mm (BsDis) 

Vein angle degree (VA) Filament length mm (FL) Stone mm (Stn) 

Venation density (VD) Style length mm (STYL) Corky mm (CRK)

Gynoecium length mm (GL)Gynoecium length mm (GL)

Ovary length mm (OL)

Stigma width mm (SL)

Page 41: Multivariate Ordination Analyses - phylodiversityphylodiversity.net/azanne/csfar/images/6/6a/Presentation.pdf · Multivariate Ordination Analyses: Principal Component Analysis Dilys

REFERENCESREFERENCES

Claude, Julien. 2008. Morphometrics with R. Springer.

Gotelli, Nicholas J., and Aaron M. Ellison. 2004. A primer of ecological statistics. Sinauer Associates Publishers.

Jolliffe, I. T. 2002. Principal component analysis. Springer.

Legendre, Pierre, and Louis Legendre. 1998. Numerical ecology. Elsevier.

Q i G ld P d Mi h l J K h 2002 E i lQuinn, Gerald Peter, and Michael J. Keough. 2002. Experimental design and data analysis for biologists. Cambridge University Press.