38
Multivariate Statistics in Ecology and Quantitative Genetics Correspondence analysis Dirk Metzler & Martin Hutzenthaler http://evol.bio.lmu.de/_statgen Summer semester 2011

Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Multivariate Statistics in Ecology andQuantitative Genetics

Correspondence analysis

Dirk Metzler & Martin Hutzenthaler

http://evol.bio.lmu.de/_statgen

Summer semester 2011

Page 2: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

1 Correspondence analysisMotivationSettingMathematical backgroundExample: Mexican plant dataSite conditional biplot and species conditional biplotExample: An artificial example

Page 3: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

Contents

1 Correspondence analysisMotivationSettingMathematical backgroundExample: Mexican plant dataSite conditional biplot and species conditional biplotExample: An artificial example

Page 4: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

The dependence of species on environmental variablesis often not linear

(often not even increasing/decreasing)

Example:The reproduction rate of bacteria depends on temperature.Low and high temperatures are not good or even lethal.

Page 5: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

The dependence of species on environmental variablesis often not linear

(often not even increasing/decreasing)

Example:The reproduction rate of bacteria depends on temperature.Low and high temperatures are not good or even lethal.

Page 6: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

The following figure shows an artificial example of abundanciesof some species along the environmental variable ’temperature’.

10 15 20 25 30

020

4060

8010

0

Temperature

Abu

ndan

cy

● ●

● ●

●●

●●

● ●●

●●

In this artificial example, the niche around 20◦ is prefered.

Page 7: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

The following figure shows an artificial example of abundanciesof some species along the environmental variable ’temperature’.

10 15 20 25 30

020

4060

8010

0

Temperature

Abu

ndan

cy

● ●

● ●

●●

●●

● ●●

●●

In this artificial example, the niche around 20◦ is prefered.

Page 8: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

A good and simple model for niches is the so-calledGaussian response model

Zi = C exp(−(Xi − µ)2

2t2

)where C, µ, t are parameters.

This model issimple: only 3 parametersgood: the Gaussian response model is chosen (among allunimodal distributions) because of the normalapproximation (central limit theorem) which loosly speakingsays that if many independent (or at most weaklydependent) effects contribute to a quantity and no effect isdominating, then the quantity is distributed normally.

Page 9: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

A good and simple model for niches is the so-calledGaussian response model

Zi = C exp(−(Xi − µ)2

2t2

)where C, µ, t are parameters.

This model issimple: only 3 parametersgood: the Gaussian response model is chosen (among allunimodal distributions) because of the normalapproximation (central limit theorem) which loosly speakingsays that if many independent (or at most weaklydependent) effects contribute to a quantity and no effect isdominating, then the quantity is distributed normally.

Page 10: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Motivation

Fitted Gaussian response model

10 15 20 25 30

020

4060

8010

0

Temperature

Abu

ndan

cy

● ●

● ●

●●

●●

● ●●

●●

Optimum µ = 20, maximum C = 100, tolerance t = 2.

Page 11: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Setting

Contents

1 Correspondence analysisMotivationSettingMathematical backgroundExample: Mexican plant dataSite conditional biplot and species conditional biplotExample: An artificial example

Page 12: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Setting

Correspondence analysis

Given: Data frame/matrix YY [i , ·] are the observations of site i

Y [·, j ] are the observations at species j

Goal: Find associations of species and sitesAssumption: There is a niche dependence of the species on the

environmental variables

The setting is formulated here in terms of species and sites.If you have measured quantities (variables) of some objects,then replace ’site’ by ’object’ and ’species’ by ’variable’.

Page 13: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Setting

Correspondence analysis

Given: Data frame/matrix YY [i , ·] are the observations of site i

Y [·, j ] are the observations at species j

Goal: Find associations of species and sites

Assumption: There is a niche dependence of the species on theenvironmental variables

The setting is formulated here in terms of species and sites.If you have measured quantities (variables) of some objects,then replace ’site’ by ’object’ and ’species’ by ’variable’.

Page 14: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Setting

Correspondence analysis

Given: Data frame/matrix YY [i , ·] are the observations of site i

Y [·, j ] are the observations at species j

Goal: Find associations of species and sitesAssumption: There is a niche dependence of the species on the

environmental variables

The setting is formulated here in terms of species and sites.If you have measured quantities (variables) of some objects,then replace ’site’ by ’object’ and ’species’ by ’variable’.

Page 15: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Setting

Correspondence analysis

Given: Data frame/matrix YY [i , ·] are the observations of site i

Y [·, j ] are the observations at species j

Goal: Find associations of species and sitesAssumption: There is a niche dependence of the species on the

environmental variables

The setting is formulated here in terms of species and sites.If you have measured quantities (variables) of some objects,then replace ’site’ by ’object’ and ’species’ by ’variable’.

Page 16: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Mathematical background

Contents

1 Correspondence analysisMotivationSettingMathematical backgroundExample: Mexican plant dataSite conditional biplot and species conditional biplotExample: An artificial example

Page 17: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Mathematical background

Let Y have M rows and N columns.The Gaussian regression from the motivation subsection wouldlead to finding species scores (u1,u2, · · · ,uN) and site scores(l1, l2, · · · , lM) such that

Y [i , k ] ≈ Ck exp(− (li − uk)

2t2k

)In fact, correspondence analysis does not use this approach buta weighted PCA approach which results in a similarrepresentation.

Page 18: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Mathematical background

The approach of correspondence analysis is based on theChi-square statistic which is used for testing the nullhypothesisthat the species do not depend on the sites. In that case wewould have

Y [i , k ]n

≈ Y [i ,+]

n· Y [+, k ]

nwhere

Y [i ,+] =∑

k Y [i , k ] is the row sum,Y [+, k ] =

∑i Y [i , k ] is the column sum and

n = Y [+,+] =∑

i

∑k Y [i , k ] is the total sum.

The Chi-square test statistic is given by

X 2 =∑∑ (Observed − Expected)2

Expected

=∑

i

∑k

(Y [i , k ]/n − Y [i ,+]Y [+, k ]/n2

)2

Y [i ,+]Y [+, k ]/n2

Page 19: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Mathematical background

The approach of correspondence analysis is based on theChi-square statistic which is used for testing the nullhypothesisthat the species do not depend on the sites. In that case wewould have

Y [i , k ]n

≈ Y [i ,+]

n· Y [+, k ]

nwhere

Y [i ,+] =∑

k Y [i , k ] is the row sum,Y [+, k ] =

∑i Y [i , k ] is the column sum and

n = Y [+,+] =∑

i

∑k Y [i , k ] is the total sum.

The Chi-square test statistic is given by

X 2 =∑∑ (Observed − Expected)2

Expected

=∑

i

∑k

(Y [i , k ]/n − Y [i ,+]Y [+, k ]/n2

)2

Y [i ,+]Y [+, k ]/n2

Page 20: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Mathematical background

Instead of frequencies we now consider probabilities

p[i , k ] := Y [i , k ]/n

and define a matrix Q with entries

Q[i , k ] :=p[i , k ]− p[i ,+] · p[+, k ]√

p[i ,+]p[+, k ]

Now all further steps are just as in PCA with thecentred/normalized matrix Y replaced by the association matrixQ. Again we get a distance biplot and a correlation biplot.

Correspondence analysis assessesthe association between species and sites

(or objects and variables)

Page 21: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Mathematical background

Instead of frequencies we now consider probabilities

p[i , k ] := Y [i , k ]/n

and define a matrix Q with entries

Q[i , k ] :=p[i , k ]− p[i ,+] · p[+, k ]√

p[i ,+]p[+, k ]

Now all further steps are just as in PCA with thecentred/normalized matrix Y replaced by the association matrixQ. Again we get a distance biplot and a correlation biplot.

Correspondence analysis assessesthe association between species and sites

(or objects and variables)

Page 22: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: Mexican plant data

Contents

1 Correspondence analysisMotivationSettingMathematical backgroundExample: Mexican plant dataSite conditional biplot and species conditional biplotExample: An artificial example

Page 23: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: Mexican plant data

Is there a difference in species community amoung fourgroups of pastures?Data of vegetation on pastures in Mexico from the dryseason of 1992.see also

Zuur, Ieno, Smith (2007) Analysing Ecological Data.Springer

Page 24: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: Mexican plant data

Meaning of the Variables

Variable names with small letters are shortcuts for plant families.ALTITUDE Altitude above sea level in metersFIELDSLOPE Field slope in degrees

AGE Time since forest clearing; index from 1-8representing ages from 6 to 40 years

CATTLEINTENSITY number of cattle per hectarPLAGUE Nominal: 0=no plague, 1=plague (herbivore insects)

in the year before samplingMAXVEGHEIGHT in centimeter

BLOCK Nominal: 1=grama pastures in Balzapote, 2=starpastures in Balzapote, 3=grama pastures in LaPalma, 2=star pastures in La Palma

(many more variables in original data set)

Page 25: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: Mexican plant data

mplants<-read.table("MexicanPlants.txt",h=T,sep="\t")

species<-mplants[,c("ac","as","co","com","cy","eu",

"grcyn","grresto","la","le","ma","ru","so","vi","vit")]

library(vegan)

mplants_CA<-cca(species)

plot(mplants_CA, scaling=1, cex=2, main="Scaling=1")

round(mplants_CA$CA$tot.chi, 2)

# 0.33

round(mplants_CA$CA$eig[1:2], 2)

# CA1 CA2

# 0.13 0.06

cat("The first two axis explain",

round(sum(mplants_CA$CA$eig[1:2])

/mplants_CA$CA$tot.chi*100), "%", "\n")

# The first two axis explain 57 %

Page 26: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: Mexican plant data

−2 −1 0 1 2 3

−2

−1

01

Scaling = 1

CA1

CA

2

ac

as

co

com

cy

eu

grcyn

grresto

la

le

ma

ru

so

vi

vit

14

1920

21

22

1315

16

171836

7

11

12

4

5

8

9

10

Page 27: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: Mexican plant data

> plot(mplants_CA, scaling = 2, main = "Scaling = 2")

Page 28: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: Mexican plant data

−2 −1 0 1 2 3

−3

−2

−1

01

Scaling = 2

CA1

CA

2

ac

as

co

com

cy

eu

grcyn

grresto

la

lema

ru

so

vi

vit

14

1920

21

22

1315

16

1718

3

6

7

11

12

4

5

8

9

10

Page 29: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Site conditional biplot and species conditional biplot

Contents

1 Correspondence analysisMotivationSettingMathematical backgroundExample: Mexican plant dataSite conditional biplot and species conditional biplotExample: An artificial example

Page 30: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Site conditional biplot and species conditional biplot

The position of a species represents the optimum value in termsof the Gausian response model (niche) along the first andsecond axes. For this reason, species scores are representedas labels or points.

Site conditional biplot (scaling=1)The sites are the centroids of the species, that is, sites areplotted close to the species which occur at those sites.Distances between sites are two-dimensionalapproximations of their Chi-square distances. So sitesclose to each other are similar in terms of the Chi-squaredistance.

Page 31: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Site conditional biplot and species conditional biplot

The position of a species represents the optimum value in termsof the Gausian response model (niche) along the first andsecond axes. For this reason, species scores are representedas labels or points.

Site conditional biplot (scaling=1)The sites are the centroids of the species, that is, sites areplotted close to the species which occur at those sites.Distances between sites are two-dimensionalapproximations of their Chi-square distances. So sitesclose to each other are similar in terms of the Chi-squaredistance.

Page 32: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Site conditional biplot and species conditional biplot

Species conditional biplot (scaling=2)The species are the centroids of the sites, that is, speciesare plotted close to the sites where they occur.Distances between species are two-dimensionalapproximations of their Chi-square distances. So speciesclose to each other are similar in terms of the Chi-squaredistance.

Page 33: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Site conditional biplot and species conditional biplot

There is also a joint plot of species and site scores (scaling=3).In this plot distances between sites and distances betweenspecies can be interpreted as the approximations of therespective Chi-square distances. However the relative positionsof sites and frequencies cannot be interpreted. So this biplot isto be used with care if used at all.

Page 34: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Site conditional biplot and species conditional biplot

Note:The total inertia (or total variance) in correspondenceanalysis is defined as the Chi-square statistic of thesite-by-species table divided by the total number ofobservations.Points further away from the origin in a biplot are the mostinteresting as these points make a relatively highcontribution to the Chi-square statistic. So the further awayfrom the origin a site is plotted, the more different it is fromthe average site.

Page 35: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: An artificial example

Contents

1 Correspondence analysisMotivationSettingMathematical backgroundExample: Mexican plant dataSite conditional biplot and species conditional biplotExample: An artificial example

Page 36: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: An artificial example

Let us look at the following artificial example which is simpleenough so that we can infer site-species correspondence ’byeye’.

Y <- matrix(c(1,0,0,1,1,0,1,1,1),nrow=3); Y

# [,1] [,2] [,3]

# [1,] 1 1 1

# [2,] 0 1 1

# [3,] 0 0 1

myca <- cca(Y)

plot(myca,scaling=1)

plot(myca,scaling=2)

p <- Y/sum(Y)

pr <- apply(p,1,sum)

pc <- apply(p,2,sum)

expec <- as.matrix(pr) %*% t( as.matrix(pc) )

sum( (p-expec)^2/expec ) # = myca$tot.chi = 0.3611

sum(myca$CA$eig/myca$tot.chi) # 1

Page 37: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: An artificial example

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Scaling=1

CA1

CA

2spe1

spe2

spe3

sit1

sit2

sit3

Page 38: Multivariate Statistics in Ecology and Quantitative ...evol.bio.lmu.de › _statgen › Multivariate › 11SS › ca.pdf · Correspondence analysis Site conditional biplot and species

Correspondence analysis Example: An artificial example

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−1.

0−

0.5

0.0

0.5

1.0

Scaling=2

CA1

CA

2

spe1

spe2

spe3

sit1

sit2

sit3