102
Using distances to address the challenges of heterogeneous data Susan Holmes http://www-stat.stanford.edu/˜susan/ Bio-X and Statistics, Stanford University July 29, 2015 ABabcdfghiejkl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Usingdistancestoaddressthechallengesofheterogeneousdata

SusanHolmeshttp://www-stat.stanford.edu/˜susan/

Bio-X andStatistics, StanfordUniversity

July29, 2015

ABabcdfghiejkl. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 2: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Themesseswedealwith

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 3: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Homogeneous data are all alike;all heterogeneous data are

heterogeneous in their own way.

.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 4: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

GoalsinModernBiology: SystemsApproachLookatthedata/allthedata: dataintegration

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 5: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

GoalsinModernBiology: SystemsApproachLookatthedata/allthedata: dataintegration

Tumor Cells

0 5000 10000 15000 20000

05000

10000

15000

20000

05e-0

40.0

010 1 1 0 -1 1 0 0 0 -1

0 1 1 0 0 0 0 0 0 1

0 1 -1 0 -1 0 0 0 0 -1

0 1 1 0 0 -1 1 0 1 1

0 1 1 0 0 0 0 1 0 1( (

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 6: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Whatdostatisticiansdo?

▶ Designnewexperimentstotestscientifichypotheses.▶ Visualizeandsummarizedatainwaysthataccountfor

uncertainties.▶ Lookformeaningfuldifferencesorstructureinhigh

dimensionalnoisydata.▶ Predicttheclassofnewobservationsgivenpreviously

observedones.▶ Predictthevalueofaresponsevariablegivenawhole

setofotherexplanatoryvariables.▶ Combinedifferentsourcesofdatatounderstandcomplex

interactions.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 7: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Today'schallenge

▶ Dataarenotuniformlydistributedfromsomemanifold.

▶ Dataarenotanidenticallydistributedrandomsample.

▶ Dataarenotindependent.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 8: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Datacanoftenbeseenaspointsinastatespace

Rp

x

x1

2x

x

x

x

x

2

.

.

.

.

.

.

p

i

1

3

.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 9: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

DistancesinStatistics

▶ EuclideanDistances, spatialdistances.▶ WeightedEuclideandistances: Mahalanobisdistancefor

discriminantanalysis.▶ Chisquaredistancesforcontingencytablesanddiscrete

data.▶ Jaccarddistancesforpresenceabsenceisoneof50

distancesusedinEcology.▶ EarthMover'sdistanceontreesorgraphs.▶ Biologicallymeaningfuldistances(DNA,haplotype,

Proteins).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 10: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Whatdostatisticiansusedistancesfor?

▶ SummariesthroughFréchetMeansandMediansandpseudovariances.

▶ CenterofCloudofObjects Tk (equalweights): Find T0

thatminimizeseither∑K

k=1 d2(T0,Tk) thisisthe (L2)definitionoftheFréchetmeanobject,

▶ or∑K

k=1 d(T0,Tk) (L1 orGeometricMedian).▶ Pseudovariance= 1

K−1

∑Kk=1 d2(T0,Tk) = s2. Dimension

reductionandvisualization.NearestNeighborMethods.Clustering.Makenetworkedgesfromclosepoints. Predictionbyminimizingweightedresidualdistances.Cross-products: correlations, autocorrelations.Generalizationsofanalysisofvariance.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 11: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Whatdostatisticiansusedistancesfor?

▶ SummariesthroughFréchetMeansandMediansandpseudovariances.

▶ Dimensionreductionandvisualization.▶ NearestNeighborMethods.▶ Clustering.▶ Makenetworkedgesfromclosepoints.▶ Predictionbyminimizingweightedresidualdistances.▶ Cross-products: correlations, autocorrelations.▶ Generalizationsofanalysisofvariance.

Findingtherightdistanceusuallysolvesthestatisticalproblem.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 12: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Part I

The Geometries of Data

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 13: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Firstexample: cellsegmentationJointworkwithAdamKapelnerandPP Lee.Stainedbiopsyslides. Multispectralimaging(8levels/wavelengths).StainedLymphNode Aimtoidentifycell.

Pointssimilarinfeaturespaceareofthesametype.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 14: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Problem: Stainingisheterogeneous

Bothimagesarefromthesameimageset. ThestainedcellsarecancercellsstainedwithFastRedred.Someregionsofthetissuestainliketheimageontheleftandotherregionsstainastheleft.ThisshowsthelevelofheterogeneityThesearetwo``subclasses''ofthesamephenotype(theleftisnamedsubclass``A,''theright, subclass``B'').

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 15: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Problem: StainingisheterogeneousExtremevariabilityintheimagecolors/intensity/contrast.Pixelsfromasamecellnotindependentandidenticallydistributedacrossthedifferentslidesoracrossdifferentcelltypes.

Simplenearestneighborapproach:-Take8dimensionalpixelspoints.-Assigningthepointtotheclosestneighbor

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 16: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Problem: StainingisheterogeneousExtremevariabilityintheimagecolors/intensity/contrast.Pixelsfromasamecellnotindependentandidenticallydistributedacrossthedifferentslidesoracrossdifferentcelltypes. ?

Simplenearestneighborapproach:-Take8dimensionalpixelspoints.-Assigningthepointtotheclosestneighbor

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 17: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

0 5 10

−2

02

46

8

Orange

Red

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 18: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

0 5 10

−2

02

46

8

Orange

Red

(3.2,2)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 19: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

0 5 10

−2

02

46

8

Orange

Red

(3.2,2)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 20: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

0 5 10

−2

02

46

8

Orange

Red

(3.2,2)

D12(p,m1)=19.7 D2

2(p,m2)=16

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 21: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

MultivariateNormalData

MahalanobisTransformation.Severaldifferentclusterswithdifferentvariance-covariancematricesanddifferentmeans.(µ1,Σ1) (µ2,Σ2)

D21(x, µ1) = (x− µ1)

TΣ−11 (x− µ1)

D22(x, µ2) = (x− µ2)

TΣ−12 (x− µ2)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 22: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

CorrespondingDataTransformation

H = I− 1Dn1T, S = X′HDnHX

zi. = S− 12 (xi. − x)

Thisissometimescalled`datasphering'.

0 5 10

−20

24

68

Sphered O

Speh

ered

Red

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 23: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 24: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

OutputDataTumor

Tumor Cells

0 5000 10000 15000 20000

050

0010

000

1500

020

000

05e

−04

0.00

1

NumberofTumorcells: 27,822 . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 25: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

WecanaddinformationthroughchoiceofdistancesSampledatacanoftenbeseen Variablesare`vectors'aspointsinastatespace. indatapointspaceRp Rn

x

x1

2x

x

x

x

x

2

.

.

.

.

.

.

p

i

1

3

. x

x 1

2x

x

x

x

x

2

.

.

.

. .

.

n

j

1

3

.

x4 .

x 3.

xtQy =< x, y >Q xtDy =< x, y >DDuality: Transposabledata. . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 26: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

DataAnalysis: GeometricalApproachi. Thedataare p variablesmeasuredon n observations.ii. X with n rows(theobservations)and p columns(the

variables).iii. D isan n× n matrixofweightsonthe``observations'',

whichismostoftendiagonalbutnotalways.iv Symmetricdefinitepositivematrix Q, weightson

. variables, often Q =

1σ21

0 0 0 ...

0 1σ22

0 0 ...

0 0. . . 0 ...

... ... ... 0 1σ2p

.

x

x1

2x

x

x

x

x

2

.

.

.

.

.

.

p

i

1

3

.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 27: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

EuclideanSpaceanddimensionreduction

Thesethreematricesformtheessential``triplet" (X,Q,D)definingamultivariatedataanalysis.Q and D definegeometriesorinnerproductsin Rp and Rn,respectively, through

xtQy =< x, y >Q x, y ∈ Rp

xtDy =< x, y >D x, y ∈ Rn.

Thiscanbeextendedtomoreinnerproductsgivingwhatisknownas Kernel methods.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 28: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

PrincipalComponentAnalysis: DimensionReduction

PCA seekstoreplacetheoriginal(centered)matrix X byamatrixoflowerrank, thiscanbesolvedusingthesingularvaluedecompositionof X:

X = USV′, with U′DU = In and V′QV = Ip and S diagonal

XX′ = US2U′, with U′DU = In and S2 = Λ

PCA isalinearnonparametricmultivariatemethodfordimensionreduction. D and Q aretherelevantmetricsonthedualrowandcolumnspacesof n samplesand p variables.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 29: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

A CommutativeDiagramApproach

CaillezandPages, 1976. Escoufier, 1977.Statisticianssearchforapproximationswithcertainproperties, forthecaseofPCA forinstance, werephrasetheproblemasfollows:

▶ Q canbeseenasalinearfunctionfrom Rp toRp∗ = L(Rp), thespaceofscalarlinearfunctionson Rp.

▶ D canbeseenasalinearfunctionfrom Rn toRn∗ = L(Rn).

V = XtDX

Rp∗ −−−−→X

Rn

Qx yV D

y xW

Rp ←−−−−Xt

Rn∗

W = XQXt

Thisdualitygives`transposable'data.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 30: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

PropertiesoftheDiagram

Rankofthediagram:X,Xt,VQ and WD allhavethesamerank.For Q and D symmetricmatrices, VQ and WD arediagonalisableandhavethesameeigenvalues.

λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.

Eigendecompositionofthediagram: VQ is Q symmetric, thuswecanfind Z suchthat

VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)

ModernextensionstothisapproachincludeKernelmethodsinMachineLearning.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 31: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

PredictingandSummarizingthroughdistances

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 32: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

ComparingTwoDiagrams: theRV coefficientManyproblemscanberephrasedintermsofcomparisonoftwo``dualitydiagrams"orputmoresimply, twocharacterizingoperators, builtfromtwo``triplets", usuallywithoneofthetripletsbeingaresponseorhavingconstraintsimposedonit.Mostoftenwhatisdoneistocomparetwosuchdiagrams,andtrytogetonetomatchtheotherinsomeoptimalway.(O = WD)Tocomparetwosymmetricoperators, thereiseitheravectorcovarianceasinnerproductcovV(O1,O2) = Tr(Ot

1O2) =< O1,O2 > oravectorcorrelation(Escoufier, 1977)

RV(O1,O2) =Tr(Ot

1O2)√Tr(Ot

1O1)tr(Ot2O2)

.

Ifweweretocomparethetwotriplets(Xn×1, 1,

1nIn

)and(

Yn×1, 1,1nIn

)wewouldhave RV = ρ2.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 33: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Part II

Dimension Reduction: theEuclidean embedding workhorse:

MDS

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 34: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

MetricMultidimensionalScalingSchoenberg(1935)

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 35: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

FromCoordinatestoDistancesandBack

Ifwestartedwithoriginaldatain Rp thatarenotcentered:Y, applythecenteringmatrix

X = HY, with H = (I− 1

n11′), and 1′ = (1, 1, 1 . . . , 1)

Call B = XX′, if D(2) isthematrixofsquareddistancesbetweenrowsofX intheeuclideancoordinates, wecanshowthat

−1

2HD(2)H = B

Schoenberg'sresult: exactEuclideandistance If B ispositivesemi-definitethen D canbeseenasadistancebetweenpointsinaEuclideanspace.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 36: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

ReverseengineeringanEuclideanembedding

Wecangobackwardsfromamatrix D to X bytakingtheeigendecompositionof B = −1

2HD(2)H inmuchthesameway

thatPCA providesthebestrank r approximationfordatabytakingthesingularvaluedecompositionof X, ortheeigendecompositionof XX′.

X(r) = US(r)V′ with S(r) =

s1 0 0 0 ...0 s2 0 0 ...0 0 ... ... ...0 0 ... sr ...... ... ... 0 0

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 37: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

MultidimensionalScaling(MDS)

Simpleclassicalmultidimensionalscaling.▶ SquareD elementwise D(2) = D2.▶ Compute −1

2 HD2H = B.▶ Diagonalize B tofindtheprincipalcoordinates SV′.▶ Chooseanumberofdimensionsbyinspectingthe

eigenvalue'sscreeplot.Theadvantageisthattheoriginaldistancesdon'thavetobeEuclidean.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 38: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

TakingCategoricalDataandMakingitintoaContinuum

HorseshoeExample:JointwithPersiDiaconisandSharadGoel(AnnalsofAppliedStats, 2005). Datafrom2005U.S.HouseofRepresentativesrollcallvotes. Wefurtherrestrictedouranalysistothe401Representativesthatvotedonatleast90% oftherollcalls(220Republicans, 180Democratsand1Independent)leadingtoa 401× 669 matrixofvotingdata.

TheDataV1 V2 V3 V4 V5 V6 V7 V8 V9 V10

R1 -1 -1 1 -1 0 1 1 1 1 1 ...

R2 -1 -1 1 -1 0 1 1 1 1 1 ...

R3 1 1 -1 1 -1 1 1 -1 -1 -1 ...

R4 1 1 -1 1 -1 1 1 -1 -1 -1 ...

R5 1 1 -1 1 -1 1 1 -1 -1 -1 ...

R6 -1 -1 1 -1 0 1 1 1 1 1 ...

R7 -1 -1 1 -1 -1 1 1 1 1 1 ...

R8 -1 -1 1 -1 0 1 1 1 1 1 ...

R9 1 1 -1 1 -1 1 1 -1 -1 -1 ...

R10 -1 -1 1 -1 0 1 1 0 0 0 .... .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 39: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

L1 distance

Wedefineadistancebetweenlegislatorsas

d(li, lj) =1

669

669∑k=1

|vik − vjk|.

Roughly, d(li, lj) isthepercentageofrollcallsonwhichlegislators li and lj disagreed.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 40: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-DimensionalMDS mappingoflegislatorsbasedonthe2005U.S.HouseofRepresentativesrollcallvotes. Weused

dissimilarityindices1-exp(−λd(R1,R2))

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 41: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

!0.1!0.05

00.05

0.1!0.2

!0.1

0

0.1

0.2

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

3-DimensionalMDS mappingoflegislatorsbasedonthe2005U.S.HouseofRepresentativesrollcallvotes. Colorhasbeenaddedtoindicatethepartyaffiliationofeachrepresentative.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 42: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

MDS Rank

Natio

nal J

ourn

al S

core

ComparisonoftheMDS derivedrankforRepresentativeswiththeNationalJournal'sliberalscore

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 43: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

AnApplication: VisualizingGeodesicDistancesbetweenTrees

▶ NearestNeighborInterchange(NNI). RotationMoves

4

0

2 31

0

4321

0

41 32

▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 44: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

AnApplication: VisualizingGeodesicDistancesbetweenTrees

▶ NearestNeighborInterchange(NNI). RotationMoves

4

0

2 31

0

4321

0

41 32

▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 45: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

AnApplication: VisualizingGeodesicDistancesbetweenTrees

▶ NearestNeighborInterchange(NNI). RotationMoves

4

0

2 31

0

4321

0

41 32

▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 46: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

AnApplication: VisualizingGeodesicDistancesbetweenTrees

▶ NearestNeighborInterchange(NNI). RotationMoves

4

0

2 31

0

4321

0

41 32

▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 47: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

A ConePath

A pathbetweentwotrees T and T′ alwaysexists. Sinceallorthantsconnectattheorigin, anytwotrees T and T′ canbeconnectedbyatwo-segmentpath, thisiscalledthecone-path.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 48: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

c

a

c

b ba

Theorem(Billera, Holmes, Vogtmann(BHV,2001)):TreespacewithBHV metricisaCAT(0)space, thatis, ithasnon-positivecurvature.Thisimpliestherearegeodesicbetweenanytwotrees(Gromov).Note: ThisspaceoftreesisnotanEuclideanspace.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 49: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

c

a

c

b baThesizeofthe``pseudo-variance''canbeestimatedfrom∑

pid(T0,Ti)2.PropertiesoftheFréchetmeanofasetoftreeshasbeen(Bhattacharyaetal.2010, Miller, Mattingley, Owen, Marron, al.2013).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 50: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

PhylogeneticTreesMalariaDataasseenusing ape

Pre1

Pme2

Plo6

Pga11

Pma3

Pbe5

Pfr7

Pkn8

Pcy9

Pvi10

Pfa4

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 51: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

SamplingDistributionforTrees

Data 1

23

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 52: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Data 1

23

Treespace Tn

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 53: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Data1

23

4

True Sampling Distribution

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 54: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Data1

23

4

Bootstrap Sampling Distribution (non parametric)

n

^

*

**

*

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 55: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

BootstrapofMalariaData

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 56: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

HierarchicalClusteringTrees

HEA2

5_EF

FE_3

MEL

39_E

FFE_

2HE

A31_

EFFE

_2M

EL67

_EFF

E_4

HEA5

5_EF

FE_4

HEA5

9_EF

FE_5

HEA2

6_EF

FE_1

MEL

51_E

FFE_

5M

EL36

_EFF

E_1

MEL

53_E

FFE_

3HE

A31_

NAI_

2HE

A55_

NAI_

4M

EL67

_NAI

_4M

EL53

_NAI

_3HE

A25_

NAI_

3M

EL51

_NAI

_5HE

A59_

NAI_

5HE

A26_

NAI_

1M

EL36

_NAI

_1M

EL39

_NAI

_2M

EL51

_MEM

_5HE

A26_

MEM

_1M

EL67

_MEM

_4HE

A31_

MEM

_2HE

A55_

MEM

_4HE

A25_

MEM

_3HE

A59_

MEM

_5M

EL53

_NAI

_3M

EL36

_MEM

_1M

EL39

_MEM

_2

Human AF5q31 protein (AF5q31) intracellular hyaluronan−bindiselectin L (lymphocyte adhesioHuman cDNA FLJ10470 fis, cloneHuman mRNA for KIAA0303 gene, KIAA0303 proteinSTAT induced STAT inhibitor 3KIAA0752 proteinGRB2−related adaptor proteindelta (Drosophila)−like 1stanninproteoglycan link proteinIncyte ESTamyloid beta (A4) precursor prHuman genomic DNA, chromosome follicular lymphoma variant trHuman, clone IMAGE:3875338, mRHuman, Similar to phosphodiestHuman sodium/myo−inositol cotrESTs, Weakly similar to MUC2_HHuman CpG island DNA genomic MPOU domain, class 2, transcripHuman zinc finger protein ZNF2Human 54 kDa progesterone receprotein tyrosine phosphatase, platelet/endothelial cell adheHuman cDNA FLJ20849 fis, cloneHuman mRNA for KIAA0972 proteiKIAA0290 proteinHuman clone 295, 5cM region sueukaryotic translation initiatferritin, heavy polypeptide 1Human cDNA: FLJ22008 fis, clonHuman insulin−like growth factgranzyme K (serine protease, gHuman Epstein−Barr virus inducPAS−serine/threonine kinaselymphotoxin beta (TNF superfamHuman mRNA for nel−related prochemokine (C−C motif) receptorPAS−serine/threonine kinaseHuman mRNA for alpha−actinin, Human mRNA encoding the c−myc Human RATS1 mRNA, complete cdshyaluronoglucosaminidase 2Human DNA for muscle nicotinicHuman epithelial V−like antigeinterferon gamma receptor 2 (iHomo sapiens clone 24775 mRNA syntaphilinHuman mRNA for endosialin protHuman zinc finger protein PLAGshort−chain dehydrogenase/reduHuman, short−chain dehydrogena

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 57: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

0 20000 40000 60000 80000 120000

Eigenvalues of MDS for bootstrapped trees

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 58: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

−40 −20 0 20 40

−60

−40

−20

020

40

Bootstrapped trees

o1

2

3

4

5

6

7

8

9

1011

12

1314

15

1617

18

19

20

21

22

23

24

25 26

2728

29

30

31

32

33

34

35

36

373839

40

41

42

43

4445

46

47 4849

50

51

52

53

54

55

5657

5859

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

7576

77

78

7980

81

8283

84

85

86

8788

89

90

91

92

93

94

95

9697

98

99

100o

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 59: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Part III

Combine and Compare Trees,Graphs and Contingent Count

Data

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 60: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

LayersofDatainthe MicrobiomeJoshuaLederberg:`theecologicalcommunityofcommensal,symbiotic, andpathogenicmicroorganismsthatliterallyshareourbodyspaceandhavebeenallbutignoredasdeterminantsofhealthanddisease'Microbiome Completecollectionofgenescontainedinthe

genomesofmicrobeslivinginagivenenvironment.

Numbers Humansshelter100trillionmicrobes(1014), (wearemadeof10 ×1012 cells).

Metagenome Compositionofallgenespresentinanenvironment(soil, gut, seawater), regardlessofspecies.

Transciptome ThesearethemRNA transcriptsinthecell, itreflectsthegenesthatarebeingactivelyexpressedatanygiventime.

Metabolome Themetabolites(smallmolecules)nucleicorfattyacids, sugars,... presentinthesampleeitherendogenousorexogenous(medication, pollution).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 61: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

.

Source: YK LeeandSK MazmanianScience, 2010.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 62: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Bacteriaetc... andUs

Thehumanmicrobiomeorhumanmicrobiotaistheassemblageofmicroorganismsthatresideonthesurfaceandindeeplayersofskin, inthesalivaandoralmucosa, intheconjunctiva, andinthegastrointestinaltracts.

▶ Theyincludebacteria, fungi, andarchaea.▶ Someoftheseorganismsperformtasksthatareuseful

forthehumanhost. (liveinsymbiosis)▶ Majorityhavenoknownbeneficialorharmfuleffect.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 63: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

HumanMicrobiome: Whatarethedata?

DNA TheGenomicmaterialpresent(16sRNA-geneespecially, butalsoshotgun).

RNA Whatgenesarebeingturnedon(geneexpression), transcriptomics.

MassSpec Specificsignaturesofchemicalcompoundspresent(LC/MS,GC/MS).

Clinical Multivariateinformationaboutpatients'clinicalstatus, medication, weight.

Environmental Location, nutrition, drugs, chemicals,temperature, time.

DomainKnowledge Metabolicnetworks, phylogenetictrees,geneontologies.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 64: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

HeterogeneousDataObjects

Objectorientedinputanddatamanipulationwith phyloseq

(McMurdieandHolmes, 2013, PlosONE)ObjectorienteddatainR:

Taxonomy Table taxonomyTableslots: .Data

OTU Abundanceclass: otuTableslots: .Data, speciesAreRows

Sample VariablessampleDataslots: .Data,names,row.names,.S3Class

Phylogenetic Treeclass: phyloslots: see ape package

matrix matrixdata.frame

phyloseqslots:otuTablesampleDatataxTabtre

Experiment-level data object:

Component data objects:

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 65: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Part IV

Heteroscedasticity: Mixturesand to Normalize them

Source: xkcd.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 66: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Pointsaremeasuredwithunequalvariance

x

x1

2x

x

x

x

x

2

.

.

.

.

.

.

p

i

1

3

.

xn ..

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 67: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Some real data (Caporoso et al, 2011)

> GlobalPatterns

phyloseq-class experiment-level object

otu_table() OTU Table: [ 19216 taxa and 26 samples ]

sample_data()Sample Data: [ 26 samples by 7 sample variables ]

tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]

phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]

> sample_sums(GlobalPatterns)

CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr

864077 1135457 697509 1543451 2076476 718943 433894 186297

.....

NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1

1478965 1652754 58688 493126 279704 937466 1211071 1216137

> summary(sample_sums(GlobalPatterns))

Min. 1st Qu. Median Mean 3rd Qu. Max.

58690 567100 1107000 1085000 1527000 2357000

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 68: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Pointsaremeasuredwithunequalvariance

x

x1

2x

x

x

x

x

2

.

.

.

.

.

.

p

i

1

3

.

xn ..

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 69: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Equalizationofvariances

Inthisbinomialexamplethevarianceoftheproportionestimateis Var(Xn ) =

pqn =

qnE(

Xn ), afunctionofthemean.

Thisisacommonoccurrenceandonethatistraditionallydealtwithinstatisticsbyapplyingvariance-stabilizingtransformations.However, inordertofindtherighttransformation, weneedagoodmodelfortheerror.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 70: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

VarianceStabilization

Prefertodealwitherrorsacrosssampleswhichareindependentandidenticallydistributed.Inparticularhomoscedasticity(equalvariances)acrossallthenoiselevels.Thisisnotthecasewhenwehaveunequalsamplesizesandvariationsintheaccuracyacrossinstruments.A standardwayofdealingwithheteroscedasticnoiseistotrytodecomposethesourcesofheterogeneityandapplytransformationsthatmakethenoisevariancealmostconstant.Thesearecalled variancestabilizingtransformations.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 71: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

MixtureModelingworksMiracles

▶ Beta-Binomial(deepSNV).▶ ZeroinflatedPoissonorGaussian.▶ Gamma-Poisson.

MixturesareubiquitousbecauseofamathematicaltheoremDeFinnetti'sTheorem

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 72: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

WolfgangHuber. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 73: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Correcttransformationsareavailable

McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'', PLOSComputationalBiology, Methods.WeproposetomodelthereadcountsIftechnicalreplicateshavesamenumberofreads: sj,Poissonvariationwithmean µ = sjui.Taxa i incidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe

Kij ∼ Poisson (sjui)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 74: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

A distanceontheknowntreeMonge-Kantorovichearthmover'sdistanceonthetree.Usedtocomparetwosamplesorbodysitesforinstance.Incorporatetaxaabundancesandphylogenetictree

Epulopiscium

Clostridium

Adlercreutzia

Lachnospira

Alistipes

Roseburia

Coprococcus

Clostridium

Blautia

Coprococcus

Dehalobacterium

Clostridium

Clostridium

Clostridium

Coprobacillus

Coprococcus

Clostridium

Clostridium

Moryella

Abundance

1

25

625

Class

Actinobacteria (class)

Bacilli

Bacteroidia

Clostridia

Erysipelotrichi

Gammaproteobacteria

Mollicutes

Verrucomicrobiae

YS2

Dualitydiagrammethodsthatcanuseanydependencystructure.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 75: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

UnifracDistance(LozuponeandKnight, 2005)

isadistancebetweengroupsoforganismsthatarerelatedtoeachotherbyatree.SupposewehavetheOTUspresentinsample1(blue)andinsample2(red).Question: Dothetwosamplesdifferphylogenetically?ItisdefinedastheratioofthesumofthelengthsofthebranchesleadingtomembersofgroupA ormembersofgroupB butnotbothtothetotalbranchlengthofthetree.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 76: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

WeightedUnifracdistance A modificationofUniFrac,weightedUniFracisdefinedin(Lozuponeetal., 2007)as

n∑i=1

bi × |AiAT− Bi

BT|

▶ n = numberofbranchesinthetree

▶ bi =lengthoftheithbranch

▶ Ai =numberofdescendantsofithbranchingroupA

▶ AT =totalnumberofsequencesingroupA

[7].[6]. . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 77: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Costelloetal. 2010

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 78: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Rao'sDistance

Westartwithadistancebetweenindividuals.Theheterogeneityofapopulation(Hi )istheaveragedistancebetweenmembersofthatpopulation.Theheterogeneitybetweentwopopulations(Hij)istheaveragedistancebetweenamemberofpopulation i andamemberofpopulation j.Thedistancebetweentwopopulationsis

Dij = Hij −1

2(Hi + Hj)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 79: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

DecompositionofDiversity

Ifwehavepopulations 1, . . . , k withfrequencies π1, . . . , πk,thenthediversityofallthepopulationstogetheris

H0 =

k∑i=1

πiHi +∑i

∑j

πiπjDij = H(w) + D(b)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 80: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

DoublePrincipalCoordinateAnalysisPavoine, DufourandChessel(2004), Purdom(2010)andFukuyamaetal. (2011). .Supposewehavenspeciesinplocationsanda(euclidean)matrix ∆ givingthesquaresofthepairwisedistancesbetweenthespecies. Thenwecan

▶ Usethedistancesbetweenspeciestofindanembeddingin n− 1 -dimensionalspacesuchthattheeuclideandistancesbetweenthespeciesisthesameasthedistancesbetweenthespeciesdefinedin ∆.

▶ Placeeachoftheplocationsatthebarycenterofitsspeciesprofile. TheeuclideandistancesbetweenthelocationswillbethesameasthesquarerootoftheRaodissimilaritybetweenthem.

▶ UsePCA tofindalower-dimensionalrepresentationofthelocations.

Givethespeciesandcommunitiescoordinatessuchthattheinertiadecomposesthesamewaythediversitydoes.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 81: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

FukuyamaandHolmes, 2012.Method Originaldescription Newformula PropertiesDPCoA square root of Rao's distance

basedonthesquarerootofthepatristicdistances

[∑

i bi(Ai/AT − Bi/BT)2]1/2 Mostsensitivetooutliers, leastsensitive to noise, upweightsdeep differences, gives OTUlocations

wUniFrac∑

i bi |Ai/AT − Bi/BT|∑

i bi |Ai/AT − Bi/BT| Less sensitive to outliers/moresensitivetonoisethanDPCoA

UniFrac fractionofbranchesleadingtoexactlyonegroup

∑i bi1{

Ai/AT−Bi/BTAi/AT+Bi/BT

≥ 1} Sensitive to noise, upweightsshallowdifferencesonthetree

Summaryofthemethodsunderconsideration. ``Outliers"referstohighlyabundantOTUs, andnoisereferstonoiseindetectinglow-abundanceOTUs(seethetextformoredetail).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 82: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

AntibioticTimeCourseData

Measurementsofabout2500differentbacterialOTUsfromstoolsamplesofthreepatients(D,E,F)Eachpatientsampled ∼ 50timesduringthecourseoftreatmentwithciprofloxacin(anantibiotic).TimescategorizedasPreCp, 1stCp, 1stWPC (weekpostcipro), Interim, 2ndCp, 2ndWPC,andPostCp.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 83: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

UniFrac

Axis 1: 14.7%

Axi

s 2:

10.

3%

−0.2

−0.1

0.0

0.1

0.2

●●●

●●

●●

●●

●●●

● ●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●● ●●

●●●●

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4

weighted UniFrac

Axis 1: 47.6%A

xis

2: 1

2.3%

−0.2

−0.1

0.0

0.1

0.2

●●● ● ●

●●

●●

●●

●●●

● ●

● ●●

●●

●● ●●●

●●

●●

●●

● ●●

−0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3

weighted UF on presence/absence

Axis 1: 32.7%

Axi

s 2:

15.

1%

−0.05

0.00

0.05

0.10

●●

●●

● ●●

●●

●●

●●●

●●

●●

● ●

−0.10−0.050.000.050.100.150.20

subject

● D

E

F

ComparingtheUniFracvariants. Fromlefttoright:PCoA/MDS withunweightedUniFrac, withweightedUniFrac,andwithweightedUniFracperformedonpresence/absencedataextractedfromtheabundancedatausedintheothertwoplots

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 84: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

(a) MDS of OTUs

Axis 1: 6.2%

Axi

s 2:

3.7

%

−1.5

−1.0

−0.5

0.0

0.5

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

−1.0 −0.5 0.0 0.5

(c) DPCoA OTU plot

CS1

CS

2

−1.0

−0.5

0.0

0.5

1.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

−1.5 −1.0 −0.5 0.0 0.5 1.0

phylum

● 4C0d−2

● Actinobacteria

● Bacteroidetes

● Candidate division TM7

● Cyanobacteria

● Firmicutes

● Fusobacteria

● Lentisphaerae

● Proteobacteria

● Synergistetes

● Verrucomicrobia

(b) DPCoA community plot

Axis 1: 40.9%

Axi

s 2:

13.

3%−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

●●●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●

● ●

−1.0 −0.5 0.0 0.5

subject

● D

E

F

(a)PCoA/MDS oftheOTUsbasedonthepatristicdistance, (b)communityand(c)speciespointsforDPCoA afterremovingtwooutlyingspecies.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 85: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

AntibioticStress

Wenextwanttovisualizetheeffectoftheantibiotic.OrdinationsofthecommunitiesduetoDPCoA andUniFracwithinformationaboutthewhetherthecommunitywasstressedornotstressed(precipro, interim, andpostciprowereconsidered``notstressed'', whilefirstcipro, firstweekpostcipro, secondcipro, andsecondweekpostciprowereconsidered``stressed'').WeseethatforUniFrac, thefirstaxisseemstoseparatethestressedcommunitiesfromthenotstressedcommunities.DPCoA alsoseemstoseparatetheoutthestressedcommunitiesalongthefirstaxis(inthedirectionassociatedwithBacteroidetes), althoughonlyforsubjectsD andE.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 86: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Axis1

Axi

s2

−0.2

−0.1

0.0

0.1

0.2

● ●

● ●

●●

●●●●

●●●

● ●

● ●●

●●●

●●●

●●

●●●

●●

●● ●●

● ●●

●●●

●●

●●●

●●

●●

●●●

D 1

D 2

E 1 E 2

F 1F 2

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4

Antibiotic stress

● 1: not stressed

2: stressed

Subject

● D

● E

● F

PCoA/MDS withunweightedUniFrac. Thelabelsrepresentsubjectplusantibioticcondition.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 87: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Axis1

Axi

s2

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

●●

●●

● ●

● ●

●●●

●●

●●

●● ●

●●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●D 1D 2

E 1

E 2

F 1F 2

−1.0 −0.5 0.0 0.5

CommunitypointsasrepresentedbyDPCoA.Thelabelsrepresentsubjectplusantibioticcondition.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 88: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

ConclusionsforAntibioticStress

SinceUniFracemphasizesshallowdifferencesonthetreeandsincePCoA/MDS withUniFracseemstoseparatethesubjectsfromeachotherbetterthantheothertwomethods, wecanconcludethatthedifferencesbetweensubjectsaremainlyshallowones. However, DPCoA alsoseparatesthesubjectsandthestressedversusnon-stressedcommunities, andexaminingthecommunityandOTU ordinationscantellusaboutthedifferencesinthecompositionsofthesecommunities.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 89: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Distancesenablestatisticiansto....

▶ Summarizedatawithmedians, meansandprincipaldirections.

▶ Encodesomevariationsinuncertainty.▶ Makecomparisonsofheterogeneoussourcesof

information.▶ Integratenetworkandtreeinformation.▶ Measurediversity, inertiaandgeneralizethenotionof

variance.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 90: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Questionsformathematicians?▶ Howtomakeamethoddesignedforuniformlydistributed

pointsworkforpointsgeneratedbymixturesofheterogeneousdistributions?ExamplesfromworkbyEdelsbrunner, Carlsson,Zoromodianandco-authors.

Source:Zoromodian.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 91: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Questionsformathematicians▶ Howtobuilddistancesbetweenimagesthataccountfor

unequalmeasurementerrors, evenlocally?

x

x1

2x

x

x

x

x

2

.

.

.

.

.

.

p

i

1

3

.

xn ..

WorkbyAdler, TaylorandWorsley(2003,2005,2007)usingRandomFields.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 92: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Questionsformathematicians

▶ HowwellcantheEuclideanembeddingapproximationsdo?

▶ Aretherebetterwaysofapproximatingthecommutativediagrams?ThisisalsoanimportantpointofcontactwiththeuseofStein'smethodinprobabilitytheory.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 93: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Questionsformathematicians

▶ HowwellcantheEuclideanembeddingapproximationsdo?

▶ Aretherebetterwaysofapproximatingthecommutativediagrams?ThisisalsoanimportantpointofcontactwiththeuseofStein'smethodinprobabilitytheory.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 94: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Questionsformathematicians

▶ Howtodistinguishbetweentheeffectofthecurvatureofastatespaceandtheeffectoftheunequalsampling?

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 95: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

AnswerscomefromDifferentialGeometry.

XavierPennec, YannOllivier, TomFletcher, RabiBhattacharya.Inparticularenableustoincorporatetherelevantdatadependenttransformationsintolocalizedmetrics.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 96: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Outputshowingposterioruncertaintymeasures

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 97: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

BenefittingfromthetoolsandschoolsofStatisticians.......

Thankstothe R community:▶ RStudiofortoolsforreproducibleresearchandHadley

Wickhamforggplot2.▶ Ecologistsandbiologists: Chessel, Jombart, Dray,

Thioulouse ade4 andEmmanuelParadis ape.

Collaborators: DavidRelman, AlfredSpormann, YvesEscoufier,LesDethfelsen, JustinSonnenburg, PersiDiaconis, SergioBaccallado, ElisabethPurdom.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 98: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

LabGroup

PostdoctoralFellowsPaul(Joey)McMurdie, BenCallahan, SimonRubinstein-Salzado, ChristofSeiler.Students: JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom NIH/NIGMS R01, NSF-VIGRE andNSF-DMS.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 99: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

ReferencesL. Billera, S. Holmes, andK. Vogtmann.Thegeometryoftreespace.Adv.Appl.Maths, 771--801, 2001.

J. ChakerianandS. Holmes.distory:Distancesbetweentrees, 2010.

DanielChessel, AnneDufour, andJeanThioulouse.Theade4package-i: One-tablemethods.R News, 4(1):5--10, 2004.

P. Diaconis, S. Goel, andS. Holmes.Horseshoesinmultidimensionalscalingandkernelmethods.AnnalsofAppliedStatistics, 2007.

Y. Escoufier.Operatorsrelatedtoadatamatrix.InJ.R.et al.Barra, editor, RecentdevelopmentsinStatistics., pages125--131.NorthHolland,, 1977.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 100: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Steven N EvansandFrederick A Matsen.ThephylogeneticKantorovich-Rubinsteinmetricforenvironmentalsequencesamples.arXiv, q-bio.PE,Jan2010.

M Hamady, C Lozupone, andR Knight.Fastunifrac: facilitatinghigh-throughputphylogeneticanalysesofmicrobialcommunitiesincludinganalysisofpyrosequencingandphylochipdata.TheISME Journal, Jan2009.

SusanHolmes.Multivariateanalysis: TheFrenchway.InD. NolanandT. P.Speed, editors, ProbabilityandStatistics: EssaysinHonorofDavidA.Freedman,volume 56ofIMS LectureNotes--MonographSeries.IMS,Beachwood, OH,2006.

RossIhakaandRobertGentleman.R:A languagefordataanalysisandgraphics.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 101: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

JournalofComputationalandGraphicalStatistics,5(3):299--314, 1996.

K. Mardia, J. Kent, andJ. Bibby.MultiariateAnalysis.AcademicPress, NY., 1979.

P. J.McMurdieandS. Holmes.Phyloseq: Reproduibleresearchplatformforbacterialcensusdata.PlosONE,2013.April22,.

SerbanNacu, RebeccaCritchley-Thorne, PeterLee, andSusanHolmes.Geneexpressionnetworkanalysisandapplicationstoimmunology.Bioinformatics, 23(7):850--8, Apr2007.

SandrinePavoine, Anne-BéatriceDufour, andDanielChessel.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 102: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way

Fromdissimilaritiesamongspeciestodissimilaritiesamongcommunities: adoubleprincipalcoordinateanalysis.JournalofTheoreticalBiology, 228(4):523--537, 2004.

ElizabethPurdom.Analysisofadatamatrixandagraph: Metagenomicdataandthephylogenetictree.AnnalsofAppliedStatistics, Jul2010.

C. R.Rao.Theuseandinterpretationofprincipalcomponentanalysisinappliedresearch.SankhyaA,26:329--359., 1964.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .