Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Usingdistancestoaddressthechallengesofheterogeneousdata
SusanHolmeshttp://www-stat.stanford.edu/˜susan/
Bio-X andStatistics, StanfordUniversity
July29, 2015
ABabcdfghiejkl. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Themesseswedealwith
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Homogeneous data are all alike;all heterogeneous data are
heterogeneous in their own way.
.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
GoalsinModernBiology: SystemsApproachLookatthedata/allthedata: dataintegration
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
GoalsinModernBiology: SystemsApproachLookatthedata/allthedata: dataintegration
Tumor Cells
0 5000 10000 15000 20000
05000
10000
15000
20000
05e-0
40.0
010 1 1 0 -1 1 0 0 0 -1
0 1 1 0 0 0 0 0 0 1
0 1 -1 0 -1 0 0 0 0 -1
0 1 1 0 0 -1 1 0 1 1
0 1 1 0 0 0 0 1 0 1( (
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Whatdostatisticiansdo?
▶ Designnewexperimentstotestscientifichypotheses.▶ Visualizeandsummarizedatainwaysthataccountfor
uncertainties.▶ Lookformeaningfuldifferencesorstructureinhigh
dimensionalnoisydata.▶ Predicttheclassofnewobservationsgivenpreviously
observedones.▶ Predictthevalueofaresponsevariablegivenawhole
setofotherexplanatoryvariables.▶ Combinedifferentsourcesofdatatounderstandcomplex
interactions.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Today'schallenge
▶ Dataarenotuniformlydistributedfromsomemanifold.
▶ Dataarenotanidenticallydistributedrandomsample.
▶ Dataarenotindependent.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Datacanoftenbeseenaspointsinastatespace
Rp
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DistancesinStatistics
▶ EuclideanDistances, spatialdistances.▶ WeightedEuclideandistances: Mahalanobisdistancefor
discriminantanalysis.▶ Chisquaredistancesforcontingencytablesanddiscrete
data.▶ Jaccarddistancesforpresenceabsenceisoneof50
distancesusedinEcology.▶ EarthMover'sdistanceontreesorgraphs.▶ Biologicallymeaningfuldistances(DNA,haplotype,
Proteins).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Whatdostatisticiansusedistancesfor?
▶ SummariesthroughFréchetMeansandMediansandpseudovariances.
▶ CenterofCloudofObjects Tk (equalweights): Find T0
thatminimizeseither∑K
k=1 d2(T0,Tk) thisisthe (L2)definitionoftheFréchetmeanobject,
▶ or∑K
k=1 d(T0,Tk) (L1 orGeometricMedian).▶ Pseudovariance= 1
K−1
∑Kk=1 d2(T0,Tk) = s2. Dimension
reductionandvisualization.NearestNeighborMethods.Clustering.Makenetworkedgesfromclosepoints. Predictionbyminimizingweightedresidualdistances.Cross-products: correlations, autocorrelations.Generalizationsofanalysisofvariance.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Whatdostatisticiansusedistancesfor?
▶ SummariesthroughFréchetMeansandMediansandpseudovariances.
▶ Dimensionreductionandvisualization.▶ NearestNeighborMethods.▶ Clustering.▶ Makenetworkedgesfromclosepoints.▶ Predictionbyminimizingweightedresidualdistances.▶ Cross-products: correlations, autocorrelations.▶ Generalizationsofanalysisofvariance.
Findingtherightdistanceusuallysolvesthestatisticalproblem.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part I
The Geometries of Data
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Firstexample: cellsegmentationJointworkwithAdamKapelnerandPP Lee.Stainedbiopsyslides. Multispectralimaging(8levels/wavelengths).StainedLymphNode Aimtoidentifycell.
Pointssimilarinfeaturespaceareofthesametype.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Problem: Stainingisheterogeneous
Bothimagesarefromthesameimageset. ThestainedcellsarecancercellsstainedwithFastRedred.Someregionsofthetissuestainliketheimageontheleftandotherregionsstainastheleft.ThisshowsthelevelofheterogeneityThesearetwo``subclasses''ofthesamephenotype(theleftisnamedsubclass``A,''theright, subclass``B'').
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Problem: StainingisheterogeneousExtremevariabilityintheimagecolors/intensity/contrast.Pixelsfromasamecellnotindependentandidenticallydistributedacrossthedifferentslidesoracrossdifferentcelltypes.
Simplenearestneighborapproach:-Take8dimensionalpixelspoints.-Assigningthepointtotheclosestneighbor
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Problem: StainingisheterogeneousExtremevariabilityintheimagecolors/intensity/contrast.Pixelsfromasamecellnotindependentandidenticallydistributedacrossthedifferentslidesoracrossdifferentcelltypes. ?
Simplenearestneighborapproach:-Take8dimensionalpixelspoints.-Assigningthepointtotheclosestneighbor
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
0 5 10
−2
02
46
8
Orange
Red
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
0 5 10
−2
02
46
8
Orange
Red
●
(3.2,2)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
0 5 10
−2
02
46
8
Orange
Red
●
(3.2,2)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
0 5 10
−2
02
46
8
Orange
Red
●
(3.2,2)
D12(p,m1)=19.7 D2
2(p,m2)=16
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
MultivariateNormalData
MahalanobisTransformation.Severaldifferentclusterswithdifferentvariance-covariancematricesanddifferentmeans.(µ1,Σ1) (µ2,Σ2)
D21(x, µ1) = (x− µ1)
TΣ−11 (x− µ1)
D22(x, µ2) = (x− µ2)
TΣ−12 (x− µ2)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
CorrespondingDataTransformation
H = I− 1Dn1T, S = X′HDnHX
zi. = S− 12 (xi. − x)
Thisissometimescalled`datasphering'.
0 5 10
−20
24
68
Sphered O
Speh
ered
Red
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
OutputDataTumor
Tumor Cells
0 5000 10000 15000 20000
050
0010
000
1500
020
000
05e
−04
0.00
1
NumberofTumorcells: 27,822 . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
WecanaddinformationthroughchoiceofdistancesSampledatacanoftenbeseen Variablesare`vectors'aspointsinastatespace. indatapointspaceRp Rn
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
. x
x 1
2x
x
x
x
x
2
.
.
.
. .
.
n
j
1
3
.
x4 .
x 3.
xtQy =< x, y >Q xtDy =< x, y >DDuality: Transposabledata. . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DataAnalysis: GeometricalApproachi. Thedataare p variablesmeasuredon n observations.ii. X with n rows(theobservations)and p columns(the
variables).iii. D isan n× n matrixofweightsonthe``observations'',
whichismostoftendiagonalbutnotalways.iv Symmetricdefinitepositivematrix Q, weightson
. variables, often Q =
1σ21
0 0 0 ...
0 1σ22
0 0 ...
0 0. . . 0 ...
... ... ... 0 1σ2p
.
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
EuclideanSpaceanddimensionreduction
Thesethreematricesformtheessential``triplet" (X,Q,D)definingamultivariatedataanalysis.Q and D definegeometriesorinnerproductsin Rp and Rn,respectively, through
xtQy =< x, y >Q x, y ∈ Rp
xtDy =< x, y >D x, y ∈ Rn.
Thiscanbeextendedtomoreinnerproductsgivingwhatisknownas Kernel methods.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
PrincipalComponentAnalysis: DimensionReduction
PCA seekstoreplacetheoriginal(centered)matrix X byamatrixoflowerrank, thiscanbesolvedusingthesingularvaluedecompositionof X:
X = USV′, with U′DU = In and V′QV = Ip and S diagonal
XX′ = US2U′, with U′DU = In and S2 = Λ
PCA isalinearnonparametricmultivariatemethodfordimensionreduction. D and Q aretherelevantmetricsonthedualrowandcolumnspacesof n samplesand p variables.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
A CommutativeDiagramApproach
CaillezandPages, 1976. Escoufier, 1977.Statisticianssearchforapproximationswithcertainproperties, forthecaseofPCA forinstance, werephrasetheproblemasfollows:
▶ Q canbeseenasalinearfunctionfrom Rp toRp∗ = L(Rp), thespaceofscalarlinearfunctionson Rp.
▶ D canbeseenasalinearfunctionfrom Rn toRn∗ = L(Rn).
▶
V = XtDX
Rp∗ −−−−→X
Rn
Qx yV D
y xW
Rp ←−−−−Xt
Rn∗
W = XQXt
Thisdualitygives`transposable'data.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
PropertiesoftheDiagram
Rankofthediagram:X,Xt,VQ and WD allhavethesamerank.For Q and D symmetricmatrices, VQ and WD arediagonalisableandhavethesameeigenvalues.
λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.
Eigendecompositionofthediagram: VQ is Q symmetric, thuswecanfind Z suchthat
VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)
ModernextensionstothisapproachincludeKernelmethodsinMachineLearning.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
PredictingandSummarizingthroughdistances
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ComparingTwoDiagrams: theRV coefficientManyproblemscanberephrasedintermsofcomparisonoftwo``dualitydiagrams"orputmoresimply, twocharacterizingoperators, builtfromtwo``triplets", usuallywithoneofthetripletsbeingaresponseorhavingconstraintsimposedonit.Mostoftenwhatisdoneistocomparetwosuchdiagrams,andtrytogetonetomatchtheotherinsomeoptimalway.(O = WD)Tocomparetwosymmetricoperators, thereiseitheravectorcovarianceasinnerproductcovV(O1,O2) = Tr(Ot
1O2) =< O1,O2 > oravectorcorrelation(Escoufier, 1977)
RV(O1,O2) =Tr(Ot
1O2)√Tr(Ot
1O1)tr(Ot2O2)
.
Ifweweretocomparethetwotriplets(Xn×1, 1,
1nIn
)and(
Yn×1, 1,1nIn
)wewouldhave RV = ρ2.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part II
Dimension Reduction: theEuclidean embedding workhorse:
MDS
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
MetricMultidimensionalScalingSchoenberg(1935)
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
FromCoordinatestoDistancesandBack
Ifwestartedwithoriginaldatain Rp thatarenotcentered:Y, applythecenteringmatrix
X = HY, with H = (I− 1
n11′), and 1′ = (1, 1, 1 . . . , 1)
Call B = XX′, if D(2) isthematrixofsquareddistancesbetweenrowsofX intheeuclideancoordinates, wecanshowthat
−1
2HD(2)H = B
Schoenberg'sresult: exactEuclideandistance If B ispositivesemi-definitethen D canbeseenasadistancebetweenpointsinaEuclideanspace.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ReverseengineeringanEuclideanembedding
Wecangobackwardsfromamatrix D to X bytakingtheeigendecompositionof B = −1
2HD(2)H inmuchthesameway
thatPCA providesthebestrank r approximationfordatabytakingthesingularvaluedecompositionof X, ortheeigendecompositionof XX′.
X(r) = US(r)V′ with S(r) =
s1 0 0 0 ...0 s2 0 0 ...0 0 ... ... ...0 0 ... sr ...... ... ... 0 0
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
MultidimensionalScaling(MDS)
Simpleclassicalmultidimensionalscaling.▶ SquareD elementwise D(2) = D2.▶ Compute −1
2 HD2H = B.▶ Diagonalize B tofindtheprincipalcoordinates SV′.▶ Chooseanumberofdimensionsbyinspectingthe
eigenvalue'sscreeplot.Theadvantageisthattheoriginaldistancesdon'thavetobeEuclidean.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
TakingCategoricalDataandMakingitintoaContinuum
HorseshoeExample:JointwithPersiDiaconisandSharadGoel(AnnalsofAppliedStats, 2005). Datafrom2005U.S.HouseofRepresentativesrollcallvotes. Wefurtherrestrictedouranalysistothe401Representativesthatvotedonatleast90% oftherollcalls(220Republicans, 180Democratsand1Independent)leadingtoa 401× 669 matrixofvotingdata.
TheDataV1 V2 V3 V4 V5 V6 V7 V8 V9 V10
R1 -1 -1 1 -1 0 1 1 1 1 1 ...
R2 -1 -1 1 -1 0 1 1 1 1 1 ...
R3 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R4 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R5 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R6 -1 -1 1 -1 0 1 1 1 1 1 ...
R7 -1 -1 1 -1 -1 1 1 1 1 1 ...
R8 -1 -1 1 -1 0 1 1 1 1 1 ...
R9 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R10 -1 -1 1 -1 0 1 1 0 0 0 .... .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
L1 distance
Wedefineadistancebetweenlegislatorsas
d(li, lj) =1
669
669∑k=1
|vik − vjk|.
Roughly, d(li, lj) isthepercentageofrollcallsonwhichlegislators li and lj disagreed.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-DimensionalMDS mappingoflegislatorsbasedonthe2005U.S.HouseofRepresentativesrollcallvotes. Weused
dissimilarityindices1-exp(−λd(R1,R2))
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-DimensionalMDS mappingoflegislatorsbasedonthe2005U.S.HouseofRepresentativesrollcallvotes. Colorhasbeenaddedtoindicatethepartyaffiliationofeachrepresentative.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
0 50 100 150 200 250 300 350 4000
10
20
30
40
50
60
70
80
90
100
MDS Rank
Natio
nal J
ourn
al S
core
ComparisonoftheMDS derivedrankforRepresentativeswiththeNationalJournal'sliberalscore
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
A ConePath
A pathbetweentwotrees T and T′ alwaysexists. Sinceallorthantsconnectattheorigin, anytwotrees T and T′ canbeconnectedbyatwo-segmentpath, thisiscalledthecone-path.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
c
a
c
b ba
Theorem(Billera, Holmes, Vogtmann(BHV,2001)):TreespacewithBHV metricisaCAT(0)space, thatis, ithasnon-positivecurvature.Thisimpliestherearegeodesicbetweenanytwotrees(Gromov).Note: ThisspaceoftreesisnotanEuclideanspace.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
c
a
c
b baThesizeofthe``pseudo-variance''canbeestimatedfrom∑
pid(T0,Ti)2.PropertiesoftheFréchetmeanofasetoftreeshasbeen(Bhattacharyaetal.2010, Miller, Mattingley, Owen, Marron, al.2013).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
PhylogeneticTreesMalariaDataasseenusing ape
Pre1
Pme2
Plo6
Pga11
Pma3
Pbe5
Pfr7
Pkn8
Pcy9
Pvi10
Pfa4
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
SamplingDistributionforTrees
Data 1
23
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Data 1
23
Treespace Tn
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Data1
23
4
True Sampling Distribution
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Data1
23
4
Bootstrap Sampling Distribution (non parametric)
n
^
*
**
*
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
BootstrapofMalariaData
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
HierarchicalClusteringTrees
HEA2
5_EF
FE_3
MEL
39_E
FFE_
2HE
A31_
EFFE
_2M
EL67
_EFF
E_4
HEA5
5_EF
FE_4
HEA5
9_EF
FE_5
HEA2
6_EF
FE_1
MEL
51_E
FFE_
5M
EL36
_EFF
E_1
MEL
53_E
FFE_
3HE
A31_
NAI_
2HE
A55_
NAI_
4M
EL67
_NAI
_4M
EL53
_NAI
_3HE
A25_
NAI_
3M
EL51
_NAI
_5HE
A59_
NAI_
5HE
A26_
NAI_
1M
EL36
_NAI
_1M
EL39
_NAI
_2M
EL51
_MEM
_5HE
A26_
MEM
_1M
EL67
_MEM
_4HE
A31_
MEM
_2HE
A55_
MEM
_4HE
A25_
MEM
_3HE
A59_
MEM
_5M
EL53
_NAI
_3M
EL36
_MEM
_1M
EL39
_MEM
_2
Human AF5q31 protein (AF5q31) intracellular hyaluronan−bindiselectin L (lymphocyte adhesioHuman cDNA FLJ10470 fis, cloneHuman mRNA for KIAA0303 gene, KIAA0303 proteinSTAT induced STAT inhibitor 3KIAA0752 proteinGRB2−related adaptor proteindelta (Drosophila)−like 1stanninproteoglycan link proteinIncyte ESTamyloid beta (A4) precursor prHuman genomic DNA, chromosome follicular lymphoma variant trHuman, clone IMAGE:3875338, mRHuman, Similar to phosphodiestHuman sodium/myo−inositol cotrESTs, Weakly similar to MUC2_HHuman CpG island DNA genomic MPOU domain, class 2, transcripHuman zinc finger protein ZNF2Human 54 kDa progesterone receprotein tyrosine phosphatase, platelet/endothelial cell adheHuman cDNA FLJ20849 fis, cloneHuman mRNA for KIAA0972 proteiKIAA0290 proteinHuman clone 295, 5cM region sueukaryotic translation initiatferritin, heavy polypeptide 1Human cDNA: FLJ22008 fis, clonHuman insulin−like growth factgranzyme K (serine protease, gHuman Epstein−Barr virus inducPAS−serine/threonine kinaselymphotoxin beta (TNF superfamHuman mRNA for nel−related prochemokine (C−C motif) receptorPAS−serine/threonine kinaseHuman mRNA for alpha−actinin, Human mRNA encoding the c−myc Human RATS1 mRNA, complete cdshyaluronoglucosaminidase 2Human DNA for muscle nicotinicHuman epithelial V−like antigeinterferon gamma receptor 2 (iHomo sapiens clone 24775 mRNA syntaphilinHuman mRNA for endosialin protHuman zinc finger protein PLAGshort−chain dehydrogenase/reduHuman, short−chain dehydrogena
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 20000 40000 60000 80000 120000
Eigenvalues of MDS for bootstrapped trees
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
−40 −20 0 20 40
−60
−40
−20
020
40
Bootstrapped trees
o1
2
3
4
5
6
7
8
9
1011
12
1314
15
1617
18
19
20
21
22
23
24
25 26
2728
29
30
31
32
33
34
35
36
373839
40
41
42
43
4445
46
47 4849
50
51
52
53
54
55
5657
5859
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
7576
77
78
7980
81
8283
84
85
86
8788
89
90
91
92
93
94
95
9697
98
99
100o
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part III
Combine and Compare Trees,Graphs and Contingent Count
Data
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
LayersofDatainthe MicrobiomeJoshuaLederberg:`theecologicalcommunityofcommensal,symbiotic, andpathogenicmicroorganismsthatliterallyshareourbodyspaceandhavebeenallbutignoredasdeterminantsofhealthanddisease'Microbiome Completecollectionofgenescontainedinthe
genomesofmicrobeslivinginagivenenvironment.
Numbers Humansshelter100trillionmicrobes(1014), (wearemadeof10 ×1012 cells).
Metagenome Compositionofallgenespresentinanenvironment(soil, gut, seawater), regardlessofspecies.
Transciptome ThesearethemRNA transcriptsinthecell, itreflectsthegenesthatarebeingactivelyexpressedatanygiventime.
Metabolome Themetabolites(smallmolecules)nucleicorfattyacids, sugars,... presentinthesampleeitherendogenousorexogenous(medication, pollution).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
.
Source: YK LeeandSK MazmanianScience, 2010.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Bacteriaetc... andUs
Thehumanmicrobiomeorhumanmicrobiotaistheassemblageofmicroorganismsthatresideonthesurfaceandindeeplayersofskin, inthesalivaandoralmucosa, intheconjunctiva, andinthegastrointestinaltracts.
▶ Theyincludebacteria, fungi, andarchaea.▶ Someoftheseorganismsperformtasksthatareuseful
forthehumanhost. (liveinsymbiosis)▶ Majorityhavenoknownbeneficialorharmfuleffect.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
HumanMicrobiome: Whatarethedata?
DNA TheGenomicmaterialpresent(16sRNA-geneespecially, butalsoshotgun).
RNA Whatgenesarebeingturnedon(geneexpression), transcriptomics.
MassSpec Specificsignaturesofchemicalcompoundspresent(LC/MS,GC/MS).
Clinical Multivariateinformationaboutpatients'clinicalstatus, medication, weight.
Environmental Location, nutrition, drugs, chemicals,temperature, time.
DomainKnowledge Metabolicnetworks, phylogenetictrees,geneontologies.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
HeterogeneousDataObjects
Objectorientedinputanddatamanipulationwith phyloseq
(McMurdieandHolmes, 2013, PlosONE)ObjectorienteddatainR:
Taxonomy Table taxonomyTableslots: .Data
OTU Abundanceclass: otuTableslots: .Data, speciesAreRows
Sample VariablessampleDataslots: .Data,names,row.names,.S3Class
Phylogenetic Treeclass: phyloslots: see ape package
matrix matrixdata.frame
phyloseqslots:otuTablesampleDatataxTabtre
Experiment-level data object:
Component data objects:
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part IV
Heteroscedasticity: Mixturesand to Normalize them
Source: xkcd.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Pointsaremeasuredwithunequalvariance
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
xn ..
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Some real data (Caporoso et al, 2011)
> GlobalPatterns
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data()Sample Data: [ 26 samples by 7 sample variables ]
tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]
> sample_sums(GlobalPatterns)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
864077 1135457 697509 1543451 2076476 718943 433894 186297
.....
NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
1478965 1652754 58688 493126 279704 937466 1211071 1216137
> summary(sample_sums(GlobalPatterns))
Min. 1st Qu. Median Mean 3rd Qu. Max.
58690 567100 1107000 1085000 1527000 2357000
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Pointsaremeasuredwithunequalvariance
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
xn ..
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Equalizationofvariances
Inthisbinomialexamplethevarianceoftheproportionestimateis Var(Xn ) =
pqn =
qnE(
Xn ), afunctionofthemean.
Thisisacommonoccurrenceandonethatistraditionallydealtwithinstatisticsbyapplyingvariance-stabilizingtransformations.However, inordertofindtherighttransformation, weneedagoodmodelfortheerror.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
VarianceStabilization
Prefertodealwitherrorsacrosssampleswhichareindependentandidenticallydistributed.Inparticularhomoscedasticity(equalvariances)acrossallthenoiselevels.Thisisnotthecasewhenwehaveunequalsamplesizesandvariationsintheaccuracyacrossinstruments.A standardwayofdealingwithheteroscedasticnoiseistotrytodecomposethesourcesofheterogeneityandapplytransformationsthatmakethenoisevariancealmostconstant.Thesearecalled variancestabilizingtransformations.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
MixtureModelingworksMiracles
▶ Beta-Binomial(deepSNV).▶ ZeroinflatedPoissonorGaussian.▶ Gamma-Poisson.
MixturesareubiquitousbecauseofamathematicaltheoremDeFinnetti'sTheorem
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
WolfgangHuber. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Correcttransformationsareavailable
McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'', PLOSComputationalBiology, Methods.WeproposetomodelthereadcountsIftechnicalreplicateshavesamenumberofreads: sj,Poissonvariationwithmean µ = sjui.Taxa i incidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe
Kij ∼ Poisson (sjui)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
A distanceontheknowntreeMonge-Kantorovichearthmover'sdistanceonthetree.Usedtocomparetwosamplesorbodysitesforinstance.Incorporatetaxaabundancesandphylogenetictree
Epulopiscium
Clostridium
Adlercreutzia
Lachnospira
Alistipes
Roseburia
Coprococcus
Clostridium
Blautia
Coprococcus
Dehalobacterium
Clostridium
Clostridium
Clostridium
Coprobacillus
Coprococcus
Clostridium
Clostridium
Moryella
Abundance
1
25
625
Class
Actinobacteria (class)
Bacilli
Bacteroidia
Clostridia
Erysipelotrichi
Gammaproteobacteria
Mollicutes
Verrucomicrobiae
YS2
Dualitydiagrammethodsthatcanuseanydependencystructure.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
UnifracDistance(LozuponeandKnight, 2005)
isadistancebetweengroupsoforganismsthatarerelatedtoeachotherbyatree.SupposewehavetheOTUspresentinsample1(blue)andinsample2(red).Question: Dothetwosamplesdifferphylogenetically?ItisdefinedastheratioofthesumofthelengthsofthebranchesleadingtomembersofgroupA ormembersofgroupB butnotbothtothetotalbranchlengthofthetree.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
WeightedUnifracdistance A modificationofUniFrac,weightedUniFracisdefinedin(Lozuponeetal., 2007)as
n∑i=1
bi × |AiAT− Bi
BT|
▶ n = numberofbranchesinthetree
▶ bi =lengthoftheithbranch
▶ Ai =numberofdescendantsofithbranchingroupA
▶ AT =totalnumberofsequencesingroupA
[7].[6]. . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Costelloetal. 2010
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Rao'sDistance
Westartwithadistancebetweenindividuals.Theheterogeneityofapopulation(Hi )istheaveragedistancebetweenmembersofthatpopulation.Theheterogeneitybetweentwopopulations(Hij)istheaveragedistancebetweenamemberofpopulation i andamemberofpopulation j.Thedistancebetweentwopopulationsis
Dij = Hij −1
2(Hi + Hj)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DecompositionofDiversity
Ifwehavepopulations 1, . . . , k withfrequencies π1, . . . , πk,thenthediversityofallthepopulationstogetheris
H0 =
k∑i=1
πiHi +∑i
∑j
πiπjDij = H(w) + D(b)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DoublePrincipalCoordinateAnalysisPavoine, DufourandChessel(2004), Purdom(2010)andFukuyamaetal. (2011). .Supposewehavenspeciesinplocationsanda(euclidean)matrix ∆ givingthesquaresofthepairwisedistancesbetweenthespecies. Thenwecan
▶ Usethedistancesbetweenspeciestofindanembeddingin n− 1 -dimensionalspacesuchthattheeuclideandistancesbetweenthespeciesisthesameasthedistancesbetweenthespeciesdefinedin ∆.
▶ Placeeachoftheplocationsatthebarycenterofitsspeciesprofile. TheeuclideandistancesbetweenthelocationswillbethesameasthesquarerootoftheRaodissimilaritybetweenthem.
▶ UsePCA tofindalower-dimensionalrepresentationofthelocations.
Givethespeciesandcommunitiescoordinatessuchthattheinertiadecomposesthesamewaythediversitydoes.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
FukuyamaandHolmes, 2012.Method Originaldescription Newformula PropertiesDPCoA square root of Rao's distance
basedonthesquarerootofthepatristicdistances
[∑
i bi(Ai/AT − Bi/BT)2]1/2 Mostsensitivetooutliers, leastsensitive to noise, upweightsdeep differences, gives OTUlocations
wUniFrac∑
i bi |Ai/AT − Bi/BT|∑
i bi |Ai/AT − Bi/BT| Less sensitive to outliers/moresensitivetonoisethanDPCoA
UniFrac fractionofbranchesleadingtoexactlyonegroup
∑i bi1{
Ai/AT−Bi/BTAi/AT+Bi/BT
≥ 1} Sensitive to noise, upweightsshallowdifferencesonthetree
Summaryofthemethodsunderconsideration. ``Outliers"referstohighlyabundantOTUs, andnoisereferstonoiseindetectinglow-abundanceOTUs(seethetextformoredetail).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AntibioticTimeCourseData
Measurementsofabout2500differentbacterialOTUsfromstoolsamplesofthreepatients(D,E,F)Eachpatientsampled ∼ 50timesduringthecourseoftreatmentwithciprofloxacin(anantibiotic).TimescategorizedasPreCp, 1stCp, 1stWPC (weekpostcipro), Interim, 2ndCp, 2ndWPC,andPostCp.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
UniFrac
Axis 1: 14.7%
Axi
s 2:
10.
3%
−0.2
−0.1
0.0
0.1
0.2
●
●
●●●
●
●
●●
●●
●
●
●●
●●●
● ●
●●
●●●●●
●●●●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●● ●●
●
●●●●
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
weighted UniFrac
Axis 1: 47.6%A
xis
2: 1
2.3%
−0.2
−0.1
0.0
0.1
0.2
●
●
●●● ● ●
●●
●●
●●
●●●
●
●
● ●
●
● ●●
●
●
●●
●● ●●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
−0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3
weighted UF on presence/absence
Axis 1: 32.7%
Axi
s 2:
15.
1%
−0.05
0.00
0.05
0.10
●
●
●●
●●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
−0.10−0.050.000.050.100.150.20
subject
● D
E
F
ComparingtheUniFracvariants. Fromlefttoright:PCoA/MDS withunweightedUniFrac, withweightedUniFrac,andwithweightedUniFracperformedonpresence/absencedataextractedfromtheabundancedatausedintheothertwoplots
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
(a) MDS of OTUs
Axis 1: 6.2%
Axi
s 2:
3.7
%
−1.5
−1.0
−0.5
0.0
0.5
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●
−1.0 −0.5 0.0 0.5
(c) DPCoA OTU plot
CS1
CS
2
−1.0
−0.5
0.0
0.5
1.0
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●
●
●●
●
●●●●●
●●●●●●●●●●●●●●●
●●●●●
●●●●●●●●●●●●●●
●●●●●
●
●
●
●●
●
●●
●
●
●
●●●●●●●●
●
●●●●●●●●●
●
●●
●
●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
−1.5 −1.0 −0.5 0.0 0.5 1.0
phylum
● 4C0d−2
● Actinobacteria
● Bacteroidetes
● Candidate division TM7
● Cyanobacteria
● Firmicutes
● Fusobacteria
● Lentisphaerae
● Proteobacteria
● Synergistetes
● Verrucomicrobia
(b) DPCoA community plot
Axis 1: 40.9%
Axi
s 2:
13.
3%−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
●
●●●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
● ●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5
subject
● D
E
F
(a)PCoA/MDS oftheOTUsbasedonthepatristicdistance, (b)communityand(c)speciespointsforDPCoA afterremovingtwooutlyingspecies.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AntibioticStress
Wenextwanttovisualizetheeffectoftheantibiotic.OrdinationsofthecommunitiesduetoDPCoA andUniFracwithinformationaboutthewhetherthecommunitywasstressedornotstressed(precipro, interim, andpostciprowereconsidered``notstressed'', whilefirstcipro, firstweekpostcipro, secondcipro, andsecondweekpostciprowereconsidered``stressed'').WeseethatforUniFrac, thefirstaxisseemstoseparatethestressedcommunitiesfromthenotstressedcommunities.DPCoA alsoseemstoseparatetheoutthestressedcommunitiesalongthefirstaxis(inthedirectionassociatedwithBacteroidetes), althoughonlyforsubjectsD andE.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Axis1
Axi
s2
−0.2
−0.1
0.0
0.1
0.2
●
●
● ●
●
●
● ●
●●
●●●●
●●●
● ●
●
●
●
●
●
● ●●
●
●●●
●
●
●
●
●●●
●●
●
●
●
●●●
●
●●
●
●
●
●● ●●
● ●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
D 1
D 2
E 1 E 2
F 1F 2
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
Antibiotic stress
● 1: not stressed
2: stressed
Subject
● D
● E
● F
PCoA/MDS withunweightedUniFrac. Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Axis1
Axi
s2
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●● ●
●
●●●
●
●
●●
● ●
●
●
●●
●
●
● ●●
●
●●
●
●●
●●
●●
● ●
●
●
●
●●
●●●
●●
●●
●
●
●●D 1D 2
E 1
E 2
F 1F 2
−1.0 −0.5 0.0 0.5
CommunitypointsasrepresentedbyDPCoA.Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ConclusionsforAntibioticStress
SinceUniFracemphasizesshallowdifferencesonthetreeandsincePCoA/MDS withUniFracseemstoseparatethesubjectsfromeachotherbetterthantheothertwomethods, wecanconcludethatthedifferencesbetweensubjectsaremainlyshallowones. However, DPCoA alsoseparatesthesubjectsandthestressedversusnon-stressedcommunities, andexaminingthecommunityandOTU ordinationscantellusaboutthedifferencesinthecompositionsofthesecommunities.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Distancesenablestatisticiansto....
▶ Summarizedatawithmedians, meansandprincipaldirections.
▶ Encodesomevariationsinuncertainty.▶ Makecomparisonsofheterogeneoussourcesof
information.▶ Integratenetworkandtreeinformation.▶ Measurediversity, inertiaandgeneralizethenotionof
variance.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Questionsformathematicians?▶ Howtomakeamethoddesignedforuniformlydistributed
pointsworkforpointsgeneratedbymixturesofheterogeneousdistributions?ExamplesfromworkbyEdelsbrunner, Carlsson,Zoromodianandco-authors.
Source:Zoromodian.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Questionsformathematicians▶ Howtobuilddistancesbetweenimagesthataccountfor
unequalmeasurementerrors, evenlocally?
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
xn ..
WorkbyAdler, TaylorandWorsley(2003,2005,2007)usingRandomFields.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Questionsformathematicians
▶ HowwellcantheEuclideanembeddingapproximationsdo?
▶ Aretherebetterwaysofapproximatingthecommutativediagrams?ThisisalsoanimportantpointofcontactwiththeuseofStein'smethodinprobabilitytheory.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Questionsformathematicians
▶ HowwellcantheEuclideanembeddingapproximationsdo?
▶ Aretherebetterwaysofapproximatingthecommutativediagrams?ThisisalsoanimportantpointofcontactwiththeuseofStein'smethodinprobabilitytheory.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Questionsformathematicians
▶ Howtodistinguishbetweentheeffectofthecurvatureofastatespaceandtheeffectoftheunequalsampling?
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AnswerscomefromDifferentialGeometry.
XavierPennec, YannOllivier, TomFletcher, RabiBhattacharya.Inparticularenableustoincorporatetherelevantdatadependenttransformationsintolocalizedmetrics.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Outputshowingposterioruncertaintymeasures
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
BenefittingfromthetoolsandschoolsofStatisticians.......
Thankstothe R community:▶ RStudiofortoolsforreproducibleresearchandHadley
Wickhamforggplot2.▶ Ecologistsandbiologists: Chessel, Jombart, Dray,
Thioulouse ade4 andEmmanuelParadis ape.
Collaborators: DavidRelman, AlfredSpormann, YvesEscoufier,LesDethfelsen, JustinSonnenburg, PersiDiaconis, SergioBaccallado, ElisabethPurdom.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
LabGroup
PostdoctoralFellowsPaul(Joey)McMurdie, BenCallahan, SimonRubinstein-Salzado, ChristofSeiler.Students: JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom NIH/NIGMS R01, NSF-VIGRE andNSF-DMS.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ReferencesL. Billera, S. Holmes, andK. Vogtmann.Thegeometryoftreespace.Adv.Appl.Maths, 771--801, 2001.
J. ChakerianandS. Holmes.distory:Distancesbetweentrees, 2010.
DanielChessel, AnneDufour, andJeanThioulouse.Theade4package-i: One-tablemethods.R News, 4(1):5--10, 2004.
P. Diaconis, S. Goel, andS. Holmes.Horseshoesinmultidimensionalscalingandkernelmethods.AnnalsofAppliedStatistics, 2007.
Y. Escoufier.Operatorsrelatedtoadatamatrix.InJ.R.et al.Barra, editor, RecentdevelopmentsinStatistics., pages125--131.NorthHolland,, 1977.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Steven N EvansandFrederick A Matsen.ThephylogeneticKantorovich-Rubinsteinmetricforenvironmentalsequencesamples.arXiv, q-bio.PE,Jan2010.
M Hamady, C Lozupone, andR Knight.Fastunifrac: facilitatinghigh-throughputphylogeneticanalysesofmicrobialcommunitiesincludinganalysisofpyrosequencingandphylochipdata.TheISME Journal, Jan2009.
SusanHolmes.Multivariateanalysis: TheFrenchway.InD. NolanandT. P.Speed, editors, ProbabilityandStatistics: EssaysinHonorofDavidA.Freedman,volume 56ofIMS LectureNotes--MonographSeries.IMS,Beachwood, OH,2006.
RossIhakaandRobertGentleman.R:A languagefordataanalysisandgraphics.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
JournalofComputationalandGraphicalStatistics,5(3):299--314, 1996.
K. Mardia, J. Kent, andJ. Bibby.MultiariateAnalysis.AcademicPress, NY., 1979.
P. J.McMurdieandS. Holmes.Phyloseq: Reproduibleresearchplatformforbacterialcensusdata.PlosONE,2013.April22,.
SerbanNacu, RebeccaCritchley-Thorne, PeterLee, andSusanHolmes.Geneexpressionnetworkanalysisandapplicationstoimmunology.Bioinformatics, 23(7):850--8, Apr2007.
SandrinePavoine, Anne-BéatriceDufour, andDanielChessel.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Fromdissimilaritiesamongspeciestodissimilaritiesamongcommunities: adoubleprincipalcoordinateanalysis.JournalofTheoreticalBiology, 228(4):523--537, 2004.
ElizabethPurdom.Analysisofadatamatrixandagraph: Metagenomicdataandthephylogenetictree.AnnalsofAppliedStatistics, Jul2010.
C. R.Rao.Theuseandinterpretationofprincipalcomponentanalysisinappliedresearch.SankhyaA,26:329--359., 1964.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .