Heterogeneous Data Challenge Combining complex Dataweb.stanford.edu/group/mmds/slides2010/Holmes.pdf · Heterogeneous Data Challenge Combining complex Data Susan Holmes susan/ Bio-X

Heterogeneous Data ChallengeCombining complex Data

Susan Holmeshttp://www-stat.stanford.edu/˜susan/

Bio-X and Statistics, Stanford University

NSF grant #0241246 and NIH-R01GM086884-2

ABabcdfghiejkl. . . . . .

Anna Karenina, Chapter 1, Leo Tolstoy

`Homogeneous data are all alike;

all heterogeneous data are heterogeneous

in their own way.'

. . . . . .

HeterogeneityI Status : response/ explanatory.I Hidden (latent)/measured.I Type :

I ContinuousI Binary, categoricalI Graphs/ TreesI ImagesI Maps/ Spatial InformationI Rankings

I Amounts of dependency: independent/timeseries/spatial.

. . . . . .

Goals in Modern Biology: Systems ApproachLook at the data/ all the data: data integration

. . . . . .

Goals in Modern Biology: Systems ApproachLook at the data/ all the data: data integration

Tumor Cells

0 5000 10000 15000 20000

05

00

01

00

00

15

00

02

00

00

05

e-0

40

.00

10 1 1 0 -1 1 0 0 0 -1

0 1 1 0 0 0 0 0 0 1

0 1 -1 0 -1 0 0 0 0 -1

0 1 1 0 0 -1 1 0 1 1

0 1 1 0 0 0 0 1 0 1( (

. . . . . .

Taking Categorical Data and Making it into aContinuum

Horseshoe Example:Joint with Persi Diaconis and SharadGoel. Data from 2005 U.S. House of Representativesroll call votes. We further restricted our analysis tothe 401 Representatives that voted on at least 90% of theroll calls (220 Republicans, 180 Democrats and 1Independent) leading to a 401× 669 matrix of voting data.

The DataV1 V2 V3 V4 V5 V6 V7 V8 V9 V10

1 -1 -1 1 -1 0 1 1 1 1 1 ...2 -1 -1 1 -1 0 1 1 1 1 1 ...3 1 1 -1 1 -1 1 1 -1 -1 -1 ...4 1 1 -1 1 -1 1 1 -1 -1 -1 ...5 1 1 -1 1 -1 1 1 -1 -1 -1 ...6 -1 -1 1 -1 0 1 1 1 1 1 ...7 -1 -1 1 -1 -1 1 1 1 1 1 ...8 -1 -1 1 -1 0 1 1 1 1 1 ...9 1 1 -1 1 -1 1 1 -1 -1 -1 ...10 -1 -1 1 -1 0 1 1 0 0 0 .... . . . . .

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-Dimensional MDS mapping of legislators based on the2005 U.S. House of Representatives roll call votes.

. . . . . .

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-Dimensional MDS mapping of legislators based on the2005 U.S. House of Representatives roll call votes.

Color has been added to indicate the party affiliation ofeach representative.

. . . . . .

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

Eigenmap Rank

Natio

nal J

ourn

al S

core

Comparison of the MDS derived rank for Representativeswith the National Journal's liberal score

. . . . . .

Iterative Structuration (Tukey, 1977)

A distance-> projection

. . . . . .

Phylogenetic Trees

. . . . . .

. . . . . .

Distances between Trees

I Nearest Neighbor Interchange (NNI).

Rotation Moves

4

0

2 31

0

4321

0

41 32

I Fill-in of NNI moves: Billera, Holmes, Vogtmann(BHV).The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an ùnresolved'tree.

. . . . . .


I Nearest Neighbor Interchange (NNI). Rotation Moves

4

0

2 31

0

4321

0

41 32


. . . . . .



4

0

2 31

0

4321

0

41 32


. . . . . .



4

0

2 31

0

4321

0

41 32


. . . . . .



4

0

2 31

0

4321

0

41 32


. . . . . .

Empirical Evidence on Mixing on Bethe LatticeE. Mossel noticed that one of the extreme points oftree space with regards to predicting the root was theBethe Lattice:

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465

. . . . . .

Seeing the Mutation Rate GradientWe generated 9 sets of trees with mutation rates setfrom α = 0.01 to α = 0.09 and we generated the dataaccording to the Bethe lattice tree.Here are the results in the first plane of the MDS:

−0.4 −0.2 0.0 0.2 0.4

−0.

06−

0.04

−0.

020.

000.

020.

040.

06

OO

12

34

5

6

7

8

9

1

1

1

1

1

1

11

1

1

2

22

222

2

2

22

33

3

3

33

333

3

444

4

4

444

4

4

55

55

5

5

5

5

5

5

6

6666

6

6

6

6

6

7

7

7

777

77

778

8

8

88

8

8

8

88

9

999

9

9

9

9

9

9

1111

111

11

1

1111

1

1

111

1

1

1111

1

11

11

111

111

1

1111

1111

11

111

1

1

1

111

1

1

1

11

1

11

11

1

1

11

11

1

1

11

1

11

1

1

1111

11

1

1

1

11

111

11

111

2

2

22

22

222

2

2

2

222

2

222

2

2

2

222

2

2

2

2

2

2

2

22

22

2

2

22

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

22

22

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

222

2

2

3

33

3

3

33

3

3

33

33

3

3

3

3

3

3

3

3

3333

3

3

3

33

3

3

333

3

33

3

3

3

3

33

3

3

33

33

3

3

3

3

3

3

3

33

33

33

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3

33

3

3

3

33

3

3

3

3

3

3

333

3

44

4

4

44

4

4444

4

4

4

44

4

4

44

4

4

4

4

44444

4

4

4

4

44

44

4

4

4

44

4

4

44

4

4

44

44

4

4

4

4

4

4

4

4

4

44

4

4

4

4

4

4

4

44

4

4

4

4

4

4

4

4

44444

44

4

4444

44

4

4

4

44

4

5

5

5

5

5

555

5

5

5

55

5

5

5

55

5

5

5

5

55

5

5

5

5

5

5

5

5

5

5

5

5

5

5

55

55 5

5

5

5

5

5

5

5

55

5

5

5

5

55

5

5

5 55

5

55

5

555

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

55

5

5

55

5

5

5

5

5

55

55

6 66

6

6

6 66

66

66

6

6

6

6

6

6

666

6

666

6

6

6

6

66

6

6

6

6

6

6

66

6

6

6

6

6

6

6

6

6

6

6

6

6

6

6

6

6

6

6

66

6

6

6

6

6

66

6

6

6

6

6

6

6

6

6

66

6

6

6

66

6666

66

6

6

6

66

6

6

6

66

6 7

7

7

77

7

7

7

7

7

7

7

7

7777

7

7

777

7

7

7

7

7

7

7

77

7

7

77777

7

7

7

7

7

7

7

77

7

7

77

7

7

7

7

77

77

7

777

77

77

7

7

7

7

7777

7

7

7

7

7

7

7

7

77

7

77

7

7

7

77

777

7

7

77

8

8 8

8

8

88

8

8

88

8

88

8

8

88

8

8

8

8

8

8

8

8

8

8

88

8

8

8

8

8

888

88

8

88

8

8

8

88

8

888

8

8

88

8

8

88

8

8

8

8

88

8

88

8

8

88

8

8

8

8

8

88888

8

8

8

8

8

88

8

8

8

8888

8

88

99

9

9

9

99

9

9

9

99

99

9

99

9

9

9

9

99

9

9

9

9

9

9

99

9

9

9

99

9 9

9

9

9

9

9

9

9

999

99

9

99

9

9

9

9

9

9

9

99

99

9

9999

9

9

9

9

9

9

9

9

9

9

9

9

9

9

99

9

9

9

9

99

9

99

9

9

9

9

9

9

O

. . . . . .

Nontechnical description of Multi-table methodsVariance, Inertia, Co-Inertia The study of variability ofone continuous variable is done through the use of thevariance. We generalize it in several directions throughthe idea of inertia.As in physics, we define inertia as a weighted sum ofdistances of weighted points.This enables us to use abundance data in a contingencytable and compute its inertia which in this case will bethe weighted sum of the squares of distances betweenobserved and expected frequencies, such as is used incomputing the chisquare statistic.Another generalization of variance-inertia is the usefulPhylogenetic diversity index. (computing the sum of thesquares of distances between a subset of taxa throughthe tree).We also have such generalizations that cover variabilityof points on a graph taken from standard spatialstatistics. . . . . . .

Co-Inertia

When studying two variables measured at the samelocations, for instance PH and humidity the standardquantification of covariation is the covariance. A simplegeneralization to this when the variability is morecomplicated to measure as above is done throughCo-Inertia analysis (CIA).Co-inertia analysis (CIA) is a multivariate method thatidentifies trends or co-relationships in multiple datasetswhich contain the same samples or the same timepoints. That is the rows or columns of the matrix haveto be weighted similarly and thus must be matchable.

. . . . . .

RV coefficient

The global measure of similarity of two data tables asopposed to two vectors can be done by a generalizationof covariance provided by an inner product between tablesthat gives the RV coefficient, a number between 0 and 1,like a correlation coefficient, but for tables.

. . . . . .

Multiple table methods

In PCA we compute the variance-covariance matrix, inmultiple table methods we can take a cube of tables andcompute the RV coefficient of their characterizingoperators.We then diagonalize this and find the best weightedènsemble'.This is called the `compromise' and all the individualtables can be projected onto it.

. . . . . .

Data Matrix: Geometrical Approach

i. The data are p variables measured on n observations.ii. X with n rows (the observations) and p columns

(the variables).iii. Dn is an n× n matrix of weights on the

`òbservations'', which is most often diagonal.iv Symmetric definite positive matrix Q,often

Q =

1σ21

0 0 0 ...

0 1σ22

0 0 ...

0 0 1σ23

0 ...

... ... ... 0 1σ2p

.

. . . . . .

Euclidean Spaces

These three matrices form the essential ``triplet"(X,Q,D) defining a multivariate data analysis.Q and D define geometries or inner products in Rp and Rn,respectively, through

xtQy =< x, y >Q x, y ∈ Rp

xtDy =< x, y >D x, y ∈ Rn.

This simple type of inner product has been generalizedto Kernels in Elizabeth Purdom's thesis (2008).

. . . . . .

An Algebraic Approach

I Q can be seen as a linear function from Rp toRp∗ = L(Rp), the space of scalar linear functions onRp.

I D can be seen as a linear function from Rn toRn∗ = L(Rn).

IRp∗ −−−−→

XRn

Qx yV D

y xW

Rp ←−−−−Xt

Rn∗

. . . . . .

An Algebraic Approach

Rp∗ −−−−→X

Rn

Qx yV D

y xW

Rp ←−−−−Xt

Rn∗

Duality diagrami. Eigendecomposition of XtDXQ = VQii. Eigendecomposition of XQXtD = WDiii. Transition Formulae.

. . . . . .

Notes

(1) Suppose we have data and inner products defined by Qand D :

(x, y) ∈ Rp × Rp 7−→ xtQy = < x, y >Q∈ R

(x, y) ∈ Rn × Rn 7−→ xtDy = < x, y >D∈ R.

||x||2Q =< x,x >Q=p∑

j=1

qj(x.j)2 ||x||2D =< x,x >D=

p∑j=1

pi(xi.)2

(2) We say an operator O is B-symmetric if< x,Oy >B=< Ox, y >B, or equivalently BO = OtB.The duality diagram is equivalent to (X,Q,D) such that Xis n× p .Escoufier (1977) defined as XQXtD = WD and XtDXQ = VQas the characteristic operators of the diagram.

. . . . . .

(3) V = XtDX will be the variance-covariance matrix, if Xis centered with regards to D (X′D1n = 0).

. . . . . .

Transposable Data

There is an important symmetry between the rows andcolumns of X in the diagram, and one can imaginesituations where the role of observation or variable isnot uniquely defined. For instance in microarray studiesthe genes can be considered either as variables orobservations. This makes sense in many contemporarysituations which evade the more classical notion of nobservations seen as a random sample of a population.It is certainly not the case that the 9,000 species are arandom sample of bacteria since these probes try to bean exhaustive set.

. . . . . .

Two Dual Geometries

. . . . . .

Properties of the Diagram

Rank of the diagram: X,Xt,VQ and WD all have the samerank.For Q and D symmetric matrices, VQ and WD arediagonalisable and have the same eigenvalues.

λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.

Eigendecomposition of the diagram: VQ is Q symmetric,thus we can find Z such that

VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)

. . . . . .

Features1. Inertia : Trace(VQ) = Trace(WD)(inertia in the sense of Huyghens inertia formula forinstance). Huygens, C. (1657),

n∑i=1

pid2(xi, a)

Inertia with regards to a point a of a cloud ofpi-weighted points.PCA with Q = Ip, D = 1

nIn, and the variables are centered,the inertia is the sum of the variances of all thevariables.If the variables are standardized (Q is the diagonal matrixof inverse variances), then the inertia is the number ofvariables p.For correspondence analysis the inertia is theChi-squared statistic.

. . . . . .

Comparing Two Diagrams: the RV coefficientMany problems can be rephrased in terms of comparisonof two ``duality diagrams" or put more simply, twocharacterizing operators, built from two ``triplets",usually with one of the triplets being a response orhaving constraints imposed on it. Most often what isdone is to compare two such diagrams, and try to getone to match the other in some optimal way.To compare two symmetric operators, there is either avector covariance as inner productcovV(O1,O2) = Tr(O1O2) =< O1,O2 > or a vectorcorrelation (Escoufier, 1977)

RV(O1,O2) =Tr(O1O2)√

Tr(Ot1O1)tr(Ot

2O2).

If we were to compare the two triplets(Xn×1, 1, 1

nIn)

and(Yn×1, 1, 1

nIn)

we would have RV = ρ2.. . . . . .

PCA: Special case

PCA can be seen as finding the matrix Y whichmaximizes the RV coefficient between characterizingoperators, that is, between

(Xn×p,Q,D

)and

(Yn×q,I,D

),

under the constraint that Y be of rank q < p .

RV(XQXtD, YYtD

)=

Tr(XQXtDYYtD

)√

Tr(XQXtD

)2 Tr(YYtD

)2.

. . . . . .

This maximum is attained where Y is chosen as thefirst q eigenvectors of XQXtD normed so that YtDY = Λq.The maximum RV is

RVmax =∑q

i=1 λ2i∑p

i=1 λ2i.

Of course, classical PCA has D = 1nI, Q = I, but the

extra flexibility is often useful. We define the distancebetween triplets (X,Q,D) and (Z,Q,M) where Z is alson× p, as the distance deduced from the RV inner productbetween operators XQXtD and ZMZtD.

. . . . . .

One Diagram to replace Two Diagrams

Canonical correlation analysis was introduced byHotelling to find the common structure in two sets ofvariables X1 and X2 measured on the same observations.This is equivalent to merging the two matricescolumnwise to form a large matrix with n rows andp1 + p2 columns and taking as the weighting of thevariables the matrix defined by the two diagonal blocks(Xt

1DX1)−1 and (Xt2DX2)−1

Q =

(Xt

1DX1)−1 0

0 (Xt2DX2)−1

. . . . . .

Rp1∗ −−−−→X1

Rn

Ip1

x yV1 Dy xW1

Rp1 ←−−−−Xt1

Rn∗

Rp2∗ −−−−→X2

Rn

Ip2

x yV2 Dy xW2

Rp2 ←−−−−Xt2

Rn∗

Rp1+p2∗ −−−−→[X1;X2]

Rn

Qx yV D

y xW

Rp1+p2 ←−−−−−[X1;X2]t

Rn∗

This analysis gives the same eigenvectors as the analysisof the triple(Xt

2DX1, (Xt1DX1)−1, (Xt

2DX2)−1), also known as thecanonical correlation analysis of X1 and X2.. . . . . .

PCA with regards to Instrumental Variables

CR Rao, 1964: Explain one matrix by another (one matrixis a response, the other explanatory). It is the extensionof PCA and regression. If Z is the explanatory table andX is the response, we take the projector:

PZ = Z(Z′DZ)−1Z′D, X̂ = PZX are the predicted values

Take the triplet (X̂,Q,D) and do the PCA.See [6] for more details.

. . . . . .

Integrating Spatial Information into the triplet

If we make Z the explanatory table contain the spatialinformation, we are integrating the spatial informationinto the multivariate analysis.Another solution explained in Dray and Jombart's paper isto study the coinertia of X and WX, the spatially laggedversion of X.

. . . . . .

Spatial Multivariate Output

. . . . . .

Jean Thioulouse uses the generalized notion ofco-inertia to analyze these complex data:

Ephemeropteraspecies (13)

Environmentalvariables (10)

Site

s (6

)

Site

s (6

)

SpringSummer

AutumnWinter

SpringSummer

AutumnWinter

An example data set consists of two data cubes. Thefirst one contains 10 environmental variables that havebeen measured four times (in Spring, Summer, Autumnand Winter) along six sampling sites. The second onecontains the numbers of Ephemeroptera belonging to 13species, collected in the same conditions.. . . . . .

Complex Output

. . . . . .

Benefitting from the tools and schools ofStatisticians.......

Thanks to the R community, in particular Chessel,Jombart, Dray, Thioulouse(ade4 group in Lyon) andEmmanuel Paradis.

Collaborators:Persi Diaconis, Sharad Goel, John Chakarian, AdamGuetz, Adam Kapelner, Elisabeth Purdom, Omar delaCruz,Nelson Ray, Yves Escoufier.

Data:Alfred Spormann, Peter Lee, Francesca Setiadi.Funding from NIH/ NIGMS R01 and NSF-DMS.

. . . . . .

L. Billera, S. Holmes, and K. Vogtmann.The geometry of tree space.Adv. Appl. Maths, 771--801, 2001.

J. Chakerian and S. Holmes.Computational tools for evaluating phylogenetic andhierarchical clustering treescomputational tools forevaluating phylogenetic and hierarchical clusteringtrees.Technical Report 1006.1015, arXiv, 2010.

P. Diaconis, S. Goel, and S. Holmes.Horseshoes in multidimensional scaling and kernelmethods.Annals of Applied Statistics, 2007.

S. Dray and T. Jombart.Revisiting guerry’s data: Introducing spatialconstraints in multivariate analysis.Annals of Applied Statistics, 2010.

. . . . . .

Y. Escoufier.Operators related to a data matrix.In J.R. Barra and coll., editors, Recent developmentsin Statistics., pages 125--131. North Holland,, 1977.

Susan Holmes.Multivariate analysis: The french way.Feschrift for David Freedman, IMS, 2006.K. Mardia, J. Kent, and J. Bibby.Multiariate Analysis.Academic Press, NY., 1979.E. Mossel.Phase transitions in phylogeny.Trans. Amer. Math. Soc., 356(6):2379--2404(electronic), 2004.C. R. Rao.The use and interpretation of principal componentanalysis in applied research.

. . . . . .

Sankhya A, 26:329--359., 1964.

J. Thioulouse.Simultaneous analysis of a sequence of pairedecological tables: A comparison of several methods.Annals of Applied Statistics, 2010.to appear.

. . . . . .

Documents

Heterogeneous Data Challenge Combining complex Dataweb.stanford.edu/group/mmds/slides2010/Holmes.pdf · Heterogeneous Data Challenge Combining complex Data Susan Holmes susan/ Bio-X