Upload
trinhkhanh
View
221
Download
0
Embed Size (px)
Citation preview
Heterogeneous Data ChallengeCombining complex Data
Susan Holmeshttp://www-stat.stanford.edu/˜susan/
Bio-X and Statistics, Stanford University
NSF grant #0241246 and NIH-R01GM086884-2
ABabcdfghiejkl. . . . . .
Anna Karenina, Chapter 1, Leo Tolstoy
`Homogeneous data are all alike;
all heterogeneous data are heterogeneous
in their own way.'
. . . . . .
HeterogeneityI Status : response/ explanatory.I Hidden (latent)/measured.I Type :
I ContinuousI Binary, categoricalI Graphs/ TreesI ImagesI Maps/ Spatial InformationI Rankings
I Amounts of dependency: independent/timeseries/spatial.
. . . . . .
Goals in Modern Biology: Systems ApproachLook at the data/ all the data: data integration
. . . . . .
Goals in Modern Biology: Systems ApproachLook at the data/ all the data: data integration
Tumor Cells
0 5000 10000 15000 20000
05
00
01
00
00
15
00
02
00
00
05
e-0
40
.00
10 1 1 0 -1 1 0 0 0 -1
0 1 1 0 0 0 0 0 0 1
0 1 -1 0 -1 0 0 0 0 -1
0 1 1 0 0 -1 1 0 1 1
0 1 1 0 0 0 0 1 0 1( (
. . . . . .
Taking Categorical Data and Making it into aContinuum
Horseshoe Example:Joint with Persi Diaconis and SharadGoel. Data from 2005 U.S. House of Representativesroll call votes. We further restricted our analysis tothe 401 Representatives that voted on at least 90% of theroll calls (220 Republicans, 180 Democrats and 1Independent) leading to a 401× 669 matrix of voting data.
The DataV1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 -1 -1 1 -1 0 1 1 1 1 1 ...2 -1 -1 1 -1 0 1 1 1 1 1 ...3 1 1 -1 1 -1 1 1 -1 -1 -1 ...4 1 1 -1 1 -1 1 1 -1 -1 -1 ...5 1 1 -1 1 -1 1 1 -1 -1 -1 ...6 -1 -1 1 -1 0 1 1 1 1 1 ...7 -1 -1 1 -1 -1 1 1 1 1 1 ...8 -1 -1 1 -1 0 1 1 1 1 1 ...9 1 1 -1 1 -1 1 1 -1 -1 -1 ...10 -1 -1 1 -1 0 1 1 0 0 0 .... . . . . .
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-Dimensional MDS mapping of legislators based on the2005 U.S. House of Representatives roll call votes.
. . . . . .
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-Dimensional MDS mapping of legislators based on the2005 U.S. House of Representatives roll call votes.
Color has been added to indicate the party affiliation ofeach representative.
. . . . . .
0 50 100 150 200 250 300 350 4000
10
20
30
40
50
60
70
80
90
100
Eigenmap Rank
Natio
nal J
ourn
al S
core
Comparison of the MDS derived rank for Representativeswith the National Journal's liberal score
. . . . . .
Iterative Structuration (Tukey, 1977)
A distance-> projection
. . . . . .
Phylogenetic Trees
. . . . . .
. . . . . .
Distances between Trees
I Nearest Neighbor Interchange (NNI).
Rotation Moves
4
0
2 31
0
4321
0
41 32
I Fill-in of NNI moves: Billera, Holmes, Vogtmann(BHV).The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved'tree.
. . . . . .
Distances between Trees
I Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
I Fill-in of NNI moves: Billera, Holmes, Vogtmann(BHV).The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved'tree.
. . . . . .
Distances between Trees
I Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
I Fill-in of NNI moves: Billera, Holmes, Vogtmann(BHV).The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved'tree.
. . . . . .
Distances between Trees
I Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
I Fill-in of NNI moves: Billera, Holmes, Vogtmann(BHV).The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved'tree.
. . . . . .
Distances between Trees
I Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
I Fill-in of NNI moves: Billera, Holmes, Vogtmann(BHV).The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved'tree.
. . . . . .
Empirical Evidence on Mixing on Bethe LatticeE. Mossel noticed that one of the extreme points oftree space with regards to predicting the root was theBethe Lattice:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
. . . . . .
Seeing the Mutation Rate GradientWe generated 9 sets of trees with mutation rates setfrom α = 0.01 to α = 0.09 and we generated the dataaccording to the Bethe lattice tree.Here are the results in the first plane of the MDS:
−0.4 −0.2 0.0 0.2 0.4
−0.
06−
0.04
−0.
020.
000.
020.
040.
06
OO
12
34
5
6
7
8
9
1
1
1
1
1
1
11
1
1
2
22
222
2
2
22
33
3
3
33
333
3
444
4
4
444
4
4
55
55
5
5
5
5
5
5
6
6666
6
6
6
6
6
7
7
7
777
77
778
8
8
88
8
8
8
88
9
999
9
9
9
9
9
9
1111
111
11
1
1111
1
1
111
1
1
1111
1
11
11
111
111
1
1111
1111
11
111
1
1
1
111
1
1
1
11
1
11
11
1
1
11
11
1
1
11
1
11
1
1
1111
11
1
1
1
11
111
11
111
2
2
22
22
222
2
2
2
222
2
222
2
2
2
222
2
2
2
2
2
2
2
22
22
2
2
22
22
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
22
2
22
22
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
222
2
2
3
33
3
3
33
3
3
33
33
3
3
3
3
3
3
3
3
3333
3
3
3
33
3
3
333
3
33
3
3
3
3
33
3
3
33
33
3
3
3
3
3
3
3
33
33
33
3
3
3
33
3
3
33
3
3
3
3
3
33
3
3
3
3
33
3
3
3
33
3
3
3
3
3
3
333
3
44
4
4
44
4
4444
4
4
4
44
4
4
44
4
4
4
4
44444
4
4
4
4
44
44
4
4
4
44
4
4
44
4
4
44
44
4
4
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
4
44444
44
4
4444
44
4
4
4
44
4
5
5
5
5
5
555
5
5
5
55
5
5
5
55
5
5
5
5
55
5
5
5
5
5
5
5
5
5
5
5
5
5
5
55
55 5
5
5
5
5
5
5
5
55
5
5
5
5
55
5
5
5 55
5
55
5
555
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
55
5
5
55
5
5
5
5
5
55
55
6 66
6
6
6 66
66
66
6
6
6
6
6
6
666
6
666
6
6
6
6
66
6
6
6
6
6
6
66
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
66
6
6
6
6
6
66
6
6
6
6
6
6
6
6
6
66
6
6
6
66
6666
66
6
6
6
66
6
6
6
66
6 7
7
7
77
7
7
7
7
7
7
7
7
7777
7
7
777
7
7
7
7
7
7
7
77
7
7
77777
7
7
7
7
7
7
7
77
7
7
77
7
7
7
7
77
77
7
777
77
77
7
7
7
7
7777
7
7
7
7
7
7
7
7
77
7
77
7
7
7
77
777
7
7
77
8
8 8
8
8
88
8
8
88
8
88
8
8
88
8
8
8
8
8
8
8
8
8
8
88
8
8
8
8
8
888
88
8
88
8
8
8
88
8
888
8
8
88
8
8
88
8
8
8
8
88
8
88
8
8
88
8
8
8
8
8
88888
8
8
8
8
8
88
8
8
8
8888
8
88
99
9
9
9
99
9
9
9
99
99
9
99
9
9
9
9
99
9
9
9
9
9
9
99
9
9
9
99
9 9
9
9
9
9
9
9
9
999
99
9
99
9
9
9
9
9
9
9
99
99
9
9999
9
9
9
9
9
9
9
9
9
9
9
9
9
9
99
9
9
9
9
99
9
99
9
9
9
9
9
9
O
. . . . . .
Nontechnical description of Multi-table methodsVariance, Inertia, Co-Inertia The study of variability ofone continuous variable is done through the use of thevariance. We generalize it in several directions throughthe idea of inertia.As in physics, we define inertia as a weighted sum ofdistances of weighted points.This enables us to use abundance data in a contingencytable and compute its inertia which in this case will bethe weighted sum of the squares of distances betweenobserved and expected frequencies, such as is used incomputing the chisquare statistic.Another generalization of variance-inertia is the usefulPhylogenetic diversity index. (computing the sum of thesquares of distances between a subset of taxa throughthe tree).We also have such generalizations that cover variabilityof points on a graph taken from standard spatialstatistics. . . . . . .
Co-Inertia
When studying two variables measured at the samelocations, for instance PH and humidity the standardquantification of covariation is the covariance. A simplegeneralization to this when the variability is morecomplicated to measure as above is done throughCo-Inertia analysis (CIA).Co-inertia analysis (CIA) is a multivariate method thatidentifies trends or co-relationships in multiple datasetswhich contain the same samples or the same timepoints. That is the rows or columns of the matrix haveto be weighted similarly and thus must be matchable.
. . . . . .
RV coefficient
The global measure of similarity of two data tables asopposed to two vectors can be done by a generalizationof covariance provided by an inner product between tablesthat gives the RV coefficient, a number between 0 and 1,like a correlation coefficient, but for tables.
. . . . . .
Multiple table methods
In PCA we compute the variance-covariance matrix, inmultiple table methods we can take a cube of tables andcompute the RV coefficient of their characterizingoperators.We then diagonalize this and find the best weighted`ensemble'.This is called the `compromise' and all the individualtables can be projected onto it.
. . . . . .
Data Matrix: Geometrical Approach
i. The data are p variables measured on n observations.ii. X with n rows (the observations) and p columns
(the variables).iii. Dn is an n× n matrix of weights on the
``observations'', which is most often diagonal.iv Symmetric definite positive matrix Q,often
Q =
1σ21
0 0 0 ...
0 1σ22
0 0 ...
0 0 1σ23
0 ...
... ... ... 0 1σ2p
.
. . . . . .
Euclidean Spaces
These three matrices form the essential ``triplet"(X,Q,D) defining a multivariate data analysis.Q and D define geometries or inner products in Rp and Rn,respectively, through
xtQy =< x, y >Q x, y ∈ Rp
xtDy =< x, y >D x, y ∈ Rn.
This simple type of inner product has been generalizedto Kernels in Elizabeth Purdom's thesis (2008).
. . . . . .
An Algebraic Approach
I Q can be seen as a linear function from Rp toRp∗ = L(Rp), the space of scalar linear functions onRp.
I D can be seen as a linear function from Rn toRn∗ = L(Rn).
IRp∗ −−−−→
XRn
Qx yV D
y xW
Rp ←−−−−Xt
Rn∗
. . . . . .
An Algebraic Approach
Rp∗ −−−−→X
Rn
Qx yV D
y xW
Rp ←−−−−Xt
Rn∗
Duality diagrami. Eigendecomposition of XtDXQ = VQii. Eigendecomposition of XQXtD = WDiii. Transition Formulae.
. . . . . .
Notes
(1) Suppose we have data and inner products defined by Qand D :
(x, y) ∈ Rp × Rp 7−→ xtQy = < x, y >Q∈ R
(x, y) ∈ Rn × Rn 7−→ xtDy = < x, y >D∈ R.
||x||2Q =< x,x >Q=p∑
j=1
qj(x.j)2 ||x||2D =< x,x >D=
p∑j=1
pi(xi.)2
(2) We say an operator O is B-symmetric if< x,Oy >B=< Ox, y >B, or equivalently BO = OtB.The duality diagram is equivalent to (X,Q,D) such that Xis n× p .Escoufier (1977) defined as XQXtD = WD and XtDXQ = VQas the characteristic operators of the diagram.
. . . . . .
(3) V = XtDX will be the variance-covariance matrix, if Xis centered with regards to D (X′D1n = 0).
. . . . . .
Transposable Data
There is an important symmetry between the rows andcolumns of X in the diagram, and one can imaginesituations where the role of observation or variable isnot uniquely defined. For instance in microarray studiesthe genes can be considered either as variables orobservations. This makes sense in many contemporarysituations which evade the more classical notion of nobservations seen as a random sample of a population.It is certainly not the case that the 9,000 species are arandom sample of bacteria since these probes try to bean exhaustive set.
. . . . . .
Two Dual Geometries
. . . . . .
Properties of the Diagram
Rank of the diagram: X,Xt,VQ and WD all have the samerank.For Q and D symmetric matrices, VQ and WD arediagonalisable and have the same eigenvalues.
λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.
Eigendecomposition of the diagram: VQ is Q symmetric,thus we can find Z such that
VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)
. . . . . .
Features1. Inertia : Trace(VQ) = Trace(WD)(inertia in the sense of Huyghens inertia formula forinstance). Huygens, C. (1657),
n∑i=1
pid2(xi, a)
Inertia with regards to a point a of a cloud ofpi-weighted points.PCA with Q = Ip, D = 1
nIn, and the variables are centered,the inertia is the sum of the variances of all thevariables.If the variables are standardized (Q is the diagonal matrixof inverse variances), then the inertia is the number ofvariables p.For correspondence analysis the inertia is theChi-squared statistic.
. . . . . .
Comparing Two Diagrams: the RV coefficientMany problems can be rephrased in terms of comparisonof two ``duality diagrams" or put more simply, twocharacterizing operators, built from two ``triplets",usually with one of the triplets being a response orhaving constraints imposed on it. Most often what isdone is to compare two such diagrams, and try to getone to match the other in some optimal way.To compare two symmetric operators, there is either avector covariance as inner productcovV(O1,O2) = Tr(O1O2) =< O1,O2 > or a vectorcorrelation (Escoufier, 1977)
RV(O1,O2) =Tr(O1O2)√
Tr(Ot1O1)tr(Ot
2O2).
If we were to compare the two triplets(Xn×1, 1, 1
nIn)
and(Yn×1, 1, 1
nIn)
we would have RV = ρ2.. . . . . .
PCA: Special case
PCA can be seen as finding the matrix Y whichmaximizes the RV coefficient between characterizingoperators, that is, between
(Xn×p,Q,D
)and
(Yn×q,I,D
),
under the constraint that Y be of rank q < p .
RV(XQXtD, YYtD
)=
Tr(XQXtDYYtD
)√
Tr(XQXtD
)2 Tr(YYtD
)2.
. . . . . .
This maximum is attained where Y is chosen as thefirst q eigenvectors of XQXtD normed so that YtDY = Λq.The maximum RV is
RVmax =∑q
i=1 λ2i∑p
i=1 λ2i.
Of course, classical PCA has D = 1nI, Q = I, but the
extra flexibility is often useful. We define the distancebetween triplets (X,Q,D) and (Z,Q,M) where Z is alson× p, as the distance deduced from the RV inner productbetween operators XQXtD and ZMZtD.
. . . . . .
One Diagram to replace Two Diagrams
Canonical correlation analysis was introduced byHotelling to find the common structure in two sets ofvariables X1 and X2 measured on the same observations.This is equivalent to merging the two matricescolumnwise to form a large matrix with n rows andp1 + p2 columns and taking as the weighting of thevariables the matrix defined by the two diagonal blocks(Xt
1DX1)−1 and (Xt2DX2)−1
Q =
(Xt
1DX1)−1 0
0 (Xt2DX2)−1
. . . . . .
Rp1∗ −−−−→X1
Rn
Ip1
x yV1 Dy xW1
Rp1 ←−−−−Xt1
Rn∗
Rp2∗ −−−−→X2
Rn
Ip2
x yV2 Dy xW2
Rp2 ←−−−−Xt2
Rn∗
Rp1+p2∗ −−−−→[X1;X2]
Rn
Qx yV D
y xW
Rp1+p2 ←−−−−−[X1;X2]t
Rn∗
This analysis gives the same eigenvectors as the analysisof the triple(Xt
2DX1, (Xt1DX1)−1, (Xt
2DX2)−1), also known as thecanonical correlation analysis of X1 and X2.. . . . . .
PCA with regards to Instrumental Variables
CR Rao, 1964: Explain one matrix by another (one matrixis a response, the other explanatory). It is the extensionof PCA and regression. If Z is the explanatory table andX is the response, we take the projector:
PZ = Z(Z′DZ)−1Z′D, X̂ = PZX are the predicted values
Take the triplet (X̂,Q,D) and do the PCA.See [6] for more details.
. . . . . .
Integrating Spatial Information into the triplet
If we make Z the explanatory table contain the spatialinformation, we are integrating the spatial informationinto the multivariate analysis.Another solution explained in Dray and Jombart's paper isto study the coinertia of X and WX, the spatially laggedversion of X.
. . . . . .
Spatial Multivariate Output
. . . . . .
Jean Thioulouse uses the generalized notion ofco-inertia to analyze these complex data:
Ephemeropteraspecies (13)
Environmentalvariables (10)
Site
s (6
)
Site
s (6
)
SpringSummer
AutumnWinter
SpringSummer
AutumnWinter
An example data set consists of two data cubes. Thefirst one contains 10 environmental variables that havebeen measured four times (in Spring, Summer, Autumnand Winter) along six sampling sites. The second onecontains the numbers of Ephemeroptera belonging to 13species, collected in the same conditions.. . . . . .
Complex Output
. . . . . .
Benefitting from the tools and schools ofStatisticians.......
Thanks to the R community, in particular Chessel,Jombart, Dray, Thioulouse(ade4 group in Lyon) andEmmanuel Paradis.
Collaborators:Persi Diaconis, Sharad Goel, John Chakarian, AdamGuetz, Adam Kapelner, Elisabeth Purdom, Omar delaCruz,Nelson Ray, Yves Escoufier.
Data:Alfred Spormann, Peter Lee, Francesca Setiadi.Funding from NIH/ NIGMS R01 and NSF-DMS.
. . . . . .
L. Billera, S. Holmes, and K. Vogtmann.The geometry of tree space.Adv. Appl. Maths, 771--801, 2001.
J. Chakerian and S. Holmes.Computational tools for evaluating phylogenetic andhierarchical clustering treescomputational tools forevaluating phylogenetic and hierarchical clusteringtrees.Technical Report 1006.1015, arXiv, 2010.
P. Diaconis, S. Goel, and S. Holmes.Horseshoes in multidimensional scaling and kernelmethods.Annals of Applied Statistics, 2007.
S. Dray and T. Jombart.Revisiting guerry’s data: Introducing spatialconstraints in multivariate analysis.Annals of Applied Statistics, 2010.
. . . . . .
Y. Escoufier.Operators related to a data matrix.In J.R. Barra and coll., editors, Recent developmentsin Statistics., pages 125--131. North Holland,, 1977.
Susan Holmes.Multivariate analysis: The french way.Feschrift for David Freedman, IMS, 2006.K. Mardia, J. Kent, and J. Bibby.Multiariate Analysis.Academic Press, NY., 1979.E. Mossel.Phase transitions in phylogeny.Trans. Amer. Math. Soc., 356(6):2379--2404(electronic), 2004.C. R. Rao.The use and interpretation of principal componentanalysis in applied research.
. . . . . .
Sankhya A, 26:329--359., 1964.
J. Thioulouse.Simultaneous analysis of a sequence of pairedecological tables: A comparison of several methods.Annals of Applied Statistics, 2010.to appear.
. . . . . .