36
MICROBIOME DATA ARE COMPOSITIONAL Adam R. Rivers, Ph.D. September 2019 https://tinyecology.com What this means and how to deal with it Agricultural Research Service

MICROBIOME DATA ARE COMPOSITIONAL - USDA ARS...triplet femur, tibia, humerus, and seeks the correlation between the indices femur / humerus and tibia / humerus. He might reasonably

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • MICROBIOME DATA ARE COMPOSITIONAL

    Adam R. Rivers, Ph.D.September 2019 https://tinyecology.com

    What this means and how to deal with it

    Agricultural Research Service

  • SEQUENCING IS A SAMPLING PROCESS• Sequencing samples n DNA molecules

    from a pool of n molecules. n

  • EVERY TAXA(GENE) AFFECTS ALL OTHERS

    • In quantitative metagenomics reads are assigned to taxa (or genes).

    • The total number of taxa and the reads assigned to each taxa affect all other taxa.

    Gloor et al. Microbiome Datasets Are Compositional

    which must be filled. Returning to our tiger and ladybuganalogy, the migration of ladybugs into an area containinga fixed number of slots that are already filled must displaceeither tigers or ladybugs from the occupied slots. This analogyextents, without restriction, to any number of taxa, and to anyfixed capacity instrument (Aitchison, 1986; Lovell et al., 2011;Friedman and Alm, 2012; Fernandes et al., 2013, 2014; Lovellet al., 2015; Mandal et al., 2015; Gloor et al., 2016a,b; Gloor andReid, 2016; Tsilimigras and Fodor, 2016). Thus, the total readcount observed in a HTS run is a fixed-size, random sampleof the relative abundance of the molecules in the underlyingecosystem. Moreover, the count can not be related to the absolutenumber of molecules in the input sample as shown in Figure 1.This is implicitly acknowledged when microbiome datasets areconverted to relative abundance values, or normalized counts,or are rarefied (McMurdie and Holmes, 2014; Weiss et al., 2017)prior to analysis. Thus the number of reads obtained is irrelevant,and contains only information on the precision of the estimate(Fernandes et al., 2013). Data that are naturally described asproportions or probabilities, or with a constant or irrelevantsum, are referred to as compositional data. Compositional datacontains information about the relationships between the parts(Aitchison, 1986; Pawlowsky-Glahn et al., 2015).

    Data about a microbiome collected by high throughputsequencing are often examined under the assumption thatsequencing is, in some way, counting the number of moleculesassociated with the bacteria in the population, as illustrated bythe top barplot in Figure 1B. We can see the difference betweencounts and compositions by comparing the data for the actualcounts for three samples in the top barplot with their proportionsin the bottom barplot. Note, that samples 2 and 3 in Figure 1Bhave the same proportional abundances even though they havedifferent absolute counts prior to sequencing. The difference inapparent direction of change is shown in Figure 1C and we canobserve that the relationship between absolute abundance in theenvironment and the relative abundance after sequencing is notpredictable.

    2. PROBLEMS WITH CURRENT METHODSOF ANALYSIS

    We will briefly outline the problems that arise whencompositional data are examined using a non-compositionalparadigm, stepping through the usual stages of analysis shownin Figure 2. All these issues have been extensively reviewedand debated in both the older and the more recent literature infields as diverse as economics, geology and ecology. Thus, ratherthan present an exhaustive explanation of the problems, we willoutline the major issue and cite a few useful resources.

    It is very difficult to collect exactly the same numberof sequence reads for each sample. This can be because ofdifferences in platform (e.g., MiSeq vs. HiSeq) or because oftechnical difficulties in loading the same molar amounts of thesequencing libraries on the instrument, or because of randomvariation. The total number of counts observed (often referred toas read depth) is a major confounder for distance or dissimilarity

    FIGURE 1 | High-throughput sequencing data are compositional. (A)

    illustrates that the data observed after sequencing a set of nucleic acids from a

    bacterial population cannot inform on the absolute abundance of molecules.

    The number of counts in a high throughput sequencing (HTS) dataset reflect

    the proportion of counts per feature (OTU, gene, etc.) per sample, multiplied by

    the sequencing depth. Therefore, only the relative abundances are available.

    The bar plots in (B) show the difference between the count of molecules and

    the proportion of molecules for two features, A (red) and B (gray) in three

    samples. The top bar graphs show the total counts for three samples, and the

    height of the color illustrates the total count of the feature. When the three

    samples are sequenced we lose the absolute count information and only have

    relative abundances, proportions, or “normalized counts” as shown in the

    bottom bar graph. Note that features A and B in samples 2 and 3 appear with

    the same relative abundances, even though the counts in the environment are

    different. The table below in (C) shows real and perceived changes for each

    sample if we transition from one sample to another.

    calculations for multivariate ordinations derived from thesedistances (McMurdie and Holmes, 2014). Initial attempts in themicrobiome field used “rarefaction” or subsampling of the readcounts of each sample to a common read depth to attempt tocorrect this problem (Lozupone et al., 2011; Wong et al., 2016).The use of subsampling has been questioned since it results ina loss of information and precision (McMurdie and Holmes,2014), and the practice of count normalization from the RNA-seq field has been advocated instead. There are a number ofcount normalization methods used and two, the trimmed meanof M values (TMM) (Robinson and Oshlack, 2010), and themedian method (Anders and Huber, 2010) are similar to a log-ratio transformations, but are less suitable in highly asymmetricalor sparse datasets (Fernandes et al., 2013; Gloor et al., 2016a).These transformations are further undesirable since the numberof counts observed by the instrument, by design, can not containany information on the actual number of molecules in theenvironment, and because the investigator naturally interpretsthe results as counts instead of log-ratios.

    One of the first analysis steps in a traditional analysis,following rarefaction or count normalization, is the calculation ofa distance or dissimilarity (DD) matrix from the data that is used

    Frontiers in Microbiology | www.frontiersin.org 2 November 2017 | Volume 8 | Article 2224

    Gloor et al. 2017

    Example 1

  • EVERY TAXA(GENE) AFFECTS ALL OTHERS

    • In quantitative metagenomics reads are assigned to taxa (or genes).

    • The total number of taxa and the reads assigned to each taxa affect all other taxa.

    Example 2

    Morton et al. 2016

    https://doi.org/10.3389/fmicb.2017.02224https://doi.org/10.1128/mSystems.00162-16.

  • EARLY WAYS OF DEALING WITH COMPOSITIONS

    Method Issues

    Rarefaction loss of information and precision

    RNASeq TMM and median ratio

    Similar to CLR but bad for asymmetrical data and sparse data

    RPKM largely abandoned since it corrects for the wrong thing

    All the above Give outputs in “counts”, the wrong mindset

  • NEGATIVE CORRELATION BIAS• We know that ratio data (including data

    normalized by a sum) suffer from negative correlation bias.

    • Because a composition must sum to one, some creations must be negative and some must be positive even if no such correlation exists. 490 Prof. Karl Pearson.

    exist correlation between u and v. Thus a real danger arises when a statistical biologist attribntes the correlation between two functions like uand vto organic relationship. The particular case that iflikely to occur is when uand v are indices with the same denominator for the correlation of indices seems at first sight a very plausible measure of organic correlation.

    The difficulty and dang’er which arise from the use of indices was brought home to me recently in an endeavour to deal with a consider able series of personal equation data. In this case it was convenient to divide the errors made by three observers in estimating a variable quantity by the actual value of the quantity. As a result there appeared a high degree of correlation between three series of abso lutely independent judgments. It was some time before I realised that this correlation had nothing to do with the manner of judging, but was a special case of the above principle due to the use of indices.

    A further illustration is of the following kind. Select three num bers within certain ranges at random, say x, y, z, these will be pair and pair uncorrelated. Form the proper fractions xjy and zjy for each triplet, and correlation will be found between these indices.

    The application of this idea to biology seems of considerable importance. For example, a quantity of bones are taken from an ossuarium, and are put together in groups, which are asserted to be those of individual skeletons. To test this a biologist takes the triplet femur, tibia, humerus, and seeks the correlation between the indices femur / humerus and tibia / humerus. He might reasonably conclude that this correlation marked organic relationship, and believe that the bones had really been put together substantially in their individual grouping. As a matter of fact, since the coefficients of variation for femur, tibia, and humerus are approximately equal, there would be, as we shall see later, a correlation of about 0'4 to 0'*> between these indices had the bones been sorted absolutely at random. I term this a spurious organic correlation, or simply a spurious correlation. I understand by this phrase the amount of correlation which would still exist between the indices, were the absolute lengths on which they depend disti’ibuted at random.

    It has hitherto been usual to measure the organic correlation of the organs of shrimps, prawns, crabs, Ac., by the correlation of indices in which the denominator represents the total body length or total cara pace length. Now suppose a table formed of the absolute lengths and the indices of, say, some thousand individuals. Let an “ imp ” (allied to the Maxwellian demon) redistribute the indices at random, they would then exhibit no correlation; if the corresponding absolute lengths followed along with the indices in the redistribution, they also would exhibit no correlation. Now let us suppose the indices not to have been calculated, but the imp to redistribute the abso-

    Pearson, 1896

  • CORRELATION EXAMPLE

    Bacteria195011

    Bacteria1B

    879362

    Bacteria2

    62763

    Bacteria3

    236274

    Bacteria4

    1219595

    Unidentified

    1205361373771166881655192528910

    19747965891046217575115306759020020110291602928417

    7517998807477188167542191681212124802802416994

    5420456493956034465749962164249099898595114189388

    302106681620188674392966865704480262980378119099

    20845431983084106192945143280248776245182224279220214416412829

    Correlated 0.998Randomly sampled from log normal distribution logmean = 10 (13 for unidentified) log SD=1

  • CORRELATION EXAMPLECorrelated 0.998

    Randomly sampled from log normal distribution logmean = 10 (13 for unidentified) log SD=1

    Divide each sample by the sumBacteria10.00418

    Bacteria1B

    0.17

    Bacteria2

    0.00857

    Bacteria3

    0.00756

    Bacteria4

    0.182

    Unidentified

    0.01360.0420.02870.0570.0494

    0.008690.1870.01430.005630.1720.008540.06120.02720.05520.0555

    0.03310.01910.01020.006020.01120.02160.03710.006110.09660.0332

    0.02380.1090.1310.01430.07440.01850.2780.2430.03930.0183

    0.01330.1290.2760.02380.1440.06420.02450.007340.0130.0373

    0.9170.3840.5610.9430.4170.8740.5570.6880.7390.806

  • CORRELATION EXAMPLE

    −1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Bacteria1

    Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    Unidentified

    0.99

    −0.04

    0.04

    0.28

    −0.78

    −0.03

    0.1

    0.28

    −0.82

    −0.18

    −0.37

    0.14

    0.04

    −0.52 −0.58

    All samples Drop Unidentified

    −1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Bacteria1

    Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    0.95

    −0.12

    −0.53

    −0.3

    −0.05

    −0.46

    −0.42

    −0.3

    −0.27 −0.41

  • CORRELATION EXAMPLECorrelation of Aitchison composition Drop Unidentified

    −1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Bacteria1

    Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    Unidentified

    0.99

    −0.04

    0.04

    0.28

    −0.78

    −0.03

    0.1

    0.28

    −0.82

    −0.18

    −0.37

    0.14

    0.04

    −0.52 −0.58−1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Bacteria1

    Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    0.99

    −0.04

    0.04

    0.28

    −0.03

    0.1

    0.28

    −0.18

    −0.37 0.04

  • DISTANCE AND DISSIMILARITY METRICSTypical metrics• Unifrac• Bray-Curtis• Jensen-Shannon

    Bray-Curtis Aitchison (Correct metric)

    Better metrics• Unifrac -> PhILR• Others -> Aitchison

    Bacteria3

    Bacteria1B

    Bacteria1

    Bacteria4

    Unidentified

    Bacteria2

    Bacteria3

    Bacteria1B

    Bacteria1

    Bacteria4

    Unidentified

    Bacteria2

    Unidentified

    Bacteria2

    Bacteria3

    Bacteria4

    Bacteria1

    Bacteria1B

    Unidentified

    Bacteria2

    Bacteria3

    Bacteria4

    Bacteria1

    Bacteria1B

    https://doi.org/10.7554/eLife.21887

  • PCA ORDINATION ISSUES

    −0.6 −0.4 −0.2 0.0 0.2 0.4

    −0.6

    −0.4

    −0.2

    0.0

    0.2

    0.4

    Comp.1

    Com

    p.2

    1

    2

    3

    4

    5

    6

    78

    9

    10

    −0.6 −0.4 −0.2 0.0 0.2 0.4

    −0.6

    −0.4

    −0.2

    0.0

    0.2

    0.4

    Bacteria1Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    Unidentified

    −0.6 −0.4 −0.2 0.0 0.2 0.4

    −0.6

    −0.4

    −0.2

    0.0

    0.2

    0.4

    Comp.1

    Com

    p.2

    1

    2

    3

    4

    5

    6

    78

    9

    10

    −3 −2 −1 0 1 2

    −3−2

    −10

    12

    Bacteria1Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    Unidentified

    Untransformed Transformed

  • UNDERSTANDING COMPOSITIONS

  • SIMPLEX AS A NUMBER SPACE• The sum does not matter

    • The components are a proportion of some whole

    • The number space is Simplex

    • Examples:• Composition of soil [silt, sand, clay]• NGS Sequencing data• Time use surveys

    Ternary soil plot

  • THE SIMPLEX NUMBER SPACE

    Key arithmetic operations can be defined:• Closure

    • Perturbation

    • Powering

    Key requirements for simplex operations• Sub-compositionally coherent• Order invariant• Scale invariant

  • TRANSFORMATIONS FROM SIMPLEXThe simplex can be transformed into real space using Log ratio transformations

    Additive Log Ratio (ALR)

    Centered Log ratio (CLR)

    Isometric Log Ratio (ILR)

    Where geometric mean is

    Working in transformed space allows us to use many

    existing statistical tools

  • ISOMETRIC LOG RATIOUseful because its isometric

    or

    DC to Tel Aviv is always 6400 km

    log-ratio analysis algebraic-geometric structure C-coordinates balances

    circles and ellipses

    -2

    -1

    0

    1

    2

    3

    4

    -2 -1 0 1 2 3

    in S3 coordinate representation

    log-ratio analysis algebraic-geometric structure C-coordinates balances

    parallel lines

    x2

    x1

    x3

    n

    -4

    -2

    0

    2

    4

    -4 -2 0 2 4

    in S3 coordinate representation

    Useful because points are now independent

  • ISOMETRIC LOG RATIO

    So what is Phi?Binary partitionsCLR X

  • ISOMETRIC LOG RATIO

    So what is Phi?Binary partitions1 -1CLR X

  • ISOMETRIC LOG RATIO

    So what is Phi?Binary partitions1 -1

    1 -1

    CLR X

  • ISOMETRIC LOG RATIO

    So what is Phi?Binary partitions1 -1

    1 -1

    CLR X

  • ISOMETRIC LOG RATIO

    So what is Phi?Binary partitions1 -1

    1 -1

    CLR X

    Dot Product

    ILR

  • ISOMETRIC LOG RATIO

    20

    40

    60

    80

    100

    20

    40

    60

    80

    100

    20 40 60 80 100

    Bacteria1BBacteria1

    Bacteria2

    Bacteria1B

    Bacteria1 Bacteria2

  • GENERAL CODA SOFTWARE

    Microbiome specific software covered later

    CoDaPack• From the “Official”

    CoDa group in Girona

    • Compositions• Compositional• zCompositions• coda.base

    Scikit bio • skbio.stats.composition

  • A COMPOSITIONAL WORKFLOW FOR METAGENOMICS

  • Normalization Distance Ordination Multivariate comparison CorrelationDifferential abundance

    Old Rarefaction DESeqBray-Curtis

    UnifracPcoA

    (abundance)perMANOVA

    ANOSIMPearson

    SpearmanEDGER DESeq

    CompositionalCLR ALR ILR

    Aitchisonphilr PCA (variance)

    perMANOVA ANOSIM

    SparCCSpiecEasi

    propr ANCOM ALDEx2

    THERE IS ANOTHER WAY

    From Gloor 2017

  • WORKED EXAMPLEBacteria195011

    Bacteria1B

    879362

    Bacteria2

    62763

    Bacteria3

    236274

    Bacteria4

    1219595

    Unidentified

    1205361373771166881655192528910

    19747965891046217575115306759020020110291602928417

    7517998807477188167542191681212124802802416994

    5420456493956034465749962164249099898595114189388

    302106681620188674392966865704480262980378119099

    20845431983084106192945143280248776245182224279220214416412829

    Preprocessing1. Close the dataset (all reads)2. Create a sub-composition

  • INPUTE ZEROS

    So many reasons for a zero…1. Structural zeros, a real zero2. Missing values3. Below the limit of detection

    So many ways to handle zeros…1. pseudo-counts ( add one to everything)2. Multiplicative replacement3. Mean value replacement4. Expectation Maximization

    It is often helpful to drop components (taxa) that appear in

  • PCA ORDINATION AND DISTANCEQiime 2 package DeicodeDeicode takes a count matrix with zeros• Calculates CLR on nonzero values• Calculates Aitchson distance• Performs robust PCA using Singular

    value decomposition

    In RWith compositions package computecar using “clr(acomp())”Distance using “dist, method=“aitchison” function Compute pca using “pca” function

  • DIVERSITY INDICES AND CURVES

    −0.6 −0.4 −0.2 0.0 0.2 0.4

    −0.6

    −0.4

    −0.2

    0.0

    0.2

    0.4

    Comp.1

    Com

    p.2

    1

    2

    3

    4

    5

    6

    78

    9

    10

    −0.6 −0.4 −0.2 0.0 0.2 0.4

    −0.6

    −0.4

    −0.2

    0.0

    0.2

    0.4

    Bacteria1Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    Unidentified

    −0.6 −0.4 −0.2 0.0 0.2 0.4

    −0.6

    −0.4

    −0.2

    0.0

    0.2

    0.4

    Comp.1

    Com

    p.2

    1

    2

    3

    4

    5

    6

    78

    9

    10

    −3 −2 −1 0 1 2

    −3−2

    −10

    12

    Bacteria1Bacteria1B

    Bacteria2

    Bacteria3

    Bacteria4

    Unidentified

  • DIVERSITY INDICES AND DISTANCEBray-Curtis Aitchison

    Bacteria3

    Bacteria1B

    Bacteria1

    Bacteria4

    Unidentified

    Bacteria2

    Bacteria3

    Bacteria1B

    Bacteria1

    Bacteria4

    Unidentified

    Bacteria2Unidentified

    Bacteria2

    Bacteria3

    Bacteria4

    Bacteria1

    Bacteria1B

    Unidentified

    Bacteria2

    Bacteria3

    Bacteria4

    Bacteria1

    Bacteria1B

  • ANCOM (Python)• One of the first• Does pairwise tests in

    CLR space• In Qiime 2

    STATISTICAL INFERENCE METHODSALDEx2 (R)• Best general use• Dirichlet used to

    model variance • Walsh and KS test• effect sizes & p-values

    MirKat-Coda (R)• Kernel machine

    regression• Identifies microbial

    subsets that explain multivariate data

    MixOmics• A suite of tools • Partial least squares

    discriminate analysis

    Gneiss• Balance trees

  • ANCOM (Python)• One of the first• Does pairwise tests in

    CLR space• In Qiime 2

    STATISTICAL INFERENCE METHODSALDEx2 (R)• Best general use• Dirichlet used to

    model variance • Walsh and KS test• effect sizes & p-values

    MirKat-Coda (R)• Kernel machine

    regression• Identifies microbial

    subsets that explain multivariate data

    MixOmics• A suite of tools • Partial least squares

    discriminate analysis

    Gneiss• Balance trees

    DIY!Do your own linear modeling on CLR

    data

  • ALDEX2

    ALDEx2

    ylab="Difference")aldex.plot(x.all, type="MW", test="welch", xlab="Dispersion",

    ylab="Difference")

    −2 0 2 4 6 8

    05

    1015

    Log−ratio abundance

    Diff

    eren

    ce

    1 2 3 4

    05

    1015

    DispersionD

    iffer

    ence

    Figure 1: MA and Effect plots of ALDEx2 output

    The left panel is an Bland-Altman or MA plot that shows the relationship between abundance and Di�er-

    ence. The right panel is an e�ect plot that shows the relationship between Di�erence and Dispersion. In

    both plots features that are not significant are in grey or black. Features that are statistically significant are

    in red. The Log-ratio abundance axis is the clr value for the feature.

    5 Using ALDEx2 modules

    The modular approach exposes the underlying intermediate data so that users can generatetheir own tests. The simple approach outlined above just calls aldex.clr, aldex.ttest,aldex.effect in turn and then merges the data into one object. We will show these modulesin turn, and then examine additional modules.

    5.1 The aldex.clr module

    The workflow for the modular approach first generates random instances of the centredlog-ratio transformed values. There are three inputs: counts table, a vector of conditions, thenumber of Monte-Carlo instances, a string indicating if iqlr, zero or all feature are used as thedenominator is required, and level of verbosity (TRUE or FALSE). We recommend 128 ormore mc.samples for the t-test, 1000 for a rigorous e�ect size calculation, and at least 16 forANOVA.

    This operation is fast.

    x

  • RESOURCESAitchison, J. (John), 2003. A concise guide to compositional data analysis. Lecture notes. url: http://www.compositionaldata.com/material/others/Ait2003_A_concise_guide_to_compositional_data_analysis.pdfAitchison, J. (John), 2003. The statistical analysis of compositional data. Blackburn Press.Calle, M.L., 2019. Statistical Analysis of Metagenomics Data. Genomics Inform. 17, e6. doi:10.5808/

    GI.2019.17.1.e6Fernandes, A.D., Vu, M.T.H.Q., Edward, L.-M., Macklaim, J.M., Gloor, G.B., 2018. A reproducible effect

    size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets.

    Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V., Egozcue, J.J., 2017. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 8, 2224. doi:10.3389/fmicb.2017.02224

    Gloor, G.B., Reid, G., 2016. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can. J. Microbiol. 62, 692–703. doi:10.1139/cjm-2015-0821

    Macklaim, J.M., Gloor, G.B., 2018. From RNA-seq to Biological Inference: Using Compositional Data Analysis in Meta-Transcriptomics. Humana Press, New York, NY, pp. 193–213. doi:10.1007/978-1-4939-8728-3_13

    Mandal, S., Van Treuren, W., White, R.A., Eggesbø, M., Knight, R., Peddada, S.D., 2015. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663. doi:10.3402/mehd.v26.27663

    Pawlowsky-Glahn, V., Buccianti, A., 2011. Compositional data analysis : theory and applications. Wiley.Rivera-Pinto, J., Egozcue, J.J., Pawlowsky–Glahn, V., Paredes, R., Noguera-Julian, M., Calle, M.L., 2017.

    Balances: a new perspective for microbiome analysis. bioRxiv 219386. doi:10.1101/219386

    MINI REVIEWpublished: 15 November 2017

    doi: 10.3389/fmicb.2017.02224

    Frontiers in Microbiology | www.frontiersin.org 1 November 2017 | Volume 8 | Article 2224

    Edited by:

    Jessica Galloway-Pena,

    University of Texas MD Anderson

    Cancer Center, United States

    Reviewed by:

    Ionas Erb,

    Centre for Genomic Regulation, Spain

    Jennifer Stearns,

    McMaster University, Canada

    *Correspondence:

    Gregory B. Gloor

    [email protected]

    Specialty section:

    This article was submitted to

    Systems Microbiology,

    a section of the journal

    Frontiers in Microbiology

    Received: 10 July 2017

    Accepted: 30 October 2017

    Published: 15 November 2017

    Citation:

    Gloor GB, Macklaim JM,

    Pawlowsky-Glahn V and Egozcue JJ

    (2017) Microbiome Datasets Are

    Compositional: And This Is Not

    Optional. Front. Microbiol. 8:2224.

    doi: 10.3389/fmicb.2017.02224

    Microbiome Datasets AreCompositional: And This Is NotOptionalGregory B. Gloor 1*, Jean M. Macklaim1, Vera Pawlowsky-Glahn2 and Juan J. Egozcue3

    1 Department of Biochemistry, University of Western Ontario, London, ON, Canada, 2 Departments of Computer Science,

    Applied Mathematics, and Statistics, Universitat de Girona, Girona, Spain, 3 Department of Applied Mathematics, Universitat

    Politècnica de Catalunya, Barcelona, Spain

    Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers,

    metagenomes or metatranscriptomes are commonplace and being used to study human

    disease states, ecological differences between sites, and the built environment. There is

    increasing awareness that microbiome datasets generated by HTS are compositional

    because they have an arbitrary total imposed by the instrument. However, many

    investigators are either unaware of this or assume specific properties of the compositional

    data. The purpose of this review is to alert investigators to the dangers inherent in

    ignoring the compositional nature of the data, and point out that HTS datasets derived

    from microbiome studies can and should be treated as compositions at all stages of

    analysis. We briefly introduce compositional data, illustrate the pathologies that occur

    when compositional data are analyzed inappropriately, and finally give guidance and point

    to resources and examples for the analysis of microbiome datasets using compositional

    data analysis.

    Keywords: microbiota, compositional data, high-throughput sequencing, correlation, Bayesian estimation, count

    normalization, relative abundance

    1. INTRODUCTION

    The collection and analysis of microbiome datasets presents many challenges in the study design,sample collection, storage, and sequencing phases, and these have been well reviewed (Robinsonet al., 2016). Many methods for the analysis of microbiome datasets assume that sequencingdata are equivalent to ecological data where the counts of reads assigned to organisms are oftennormalized to a constant area or volume. Methods applied include count-based strategies such asBray-Curtis dissimilarity, zero-inflated Gaussianmodels and negative binomial models (McMurdieand Holmes, 2014; Weiss et al., 2017).

    In an ecological study it is possible for many different species to co-exist, and their absoluteabundance may be important. For example, in an area containing only tigers, it is importantto know if the population size is sufficient to maintain needed genetic diversity for long-termsurvival (Shaffer, 1981). However, the abundance of one species may not influence the abundanceof another; the area may contain both tigers and ladybugs, and the migration of several ladybugsinto the area would not be expected to affect the number of tigers.

    The assumption of true independence can not hold in high-throughput sequencing (HTS)experiments because the sequencing instruments can deliver reads only up to the capacity of theinstrument. Thus, it is proper to think of these instruments as containing a fixed number of slots

  • THANKS

    CodaCourse Staff University of Girona (UdG) 

    Agricultural Research Service