21
Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Embed Size (px)

Citation preview

Page 1: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Compositionality and Sparseness in 16S rRNA data

Anthony Fodor Associate ProfessorBioinformatics and Genomics UNC Charlotte

Page 2: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Can we fairly compare high and low biomass samples?

VS.

Page 3: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Low abundance samples are inherently challenging to survey

Less abundantHanna et al - Comparison of culture and molecular techniques for microbial community characterization in infected necrotizing pancreatitis - J. Surgical Research - 2014

Page 4: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Clearly the sequencing of negative controls should be part of all of our pipelines..

Page 5: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Can we fairly compare samples with different numbers of sequences?

VS.

Page 6: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

16S rRNA experiments are always compositional and often sparse

Compositional – because different samples have different numbers of sequencesSparse – because there are many zeros in the spreadsheet

SAMPLES

OTUs

Page 7: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Compositionality is a well-studied problem in statistics, but remains challenging

Page 8: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Compositionality can introduce subtle artifacts into our dataset

Relative abundance

Problems include:Inference may report a change in A and B even thoughbiologically A and B have not changed.

The estimate of A and B is dependent on C. If C is contaminant (or rRNA in a RNA-seq experiment), the values ofA and B might not be appropriate.

A and B will appear correlated, but this is a statistical artifact.

Page 9: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

The correlation issue has been considered by multiple groups…

Page 10: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

The compositional nature of 16S rRNA data has led to controversies over analysis pipelines…

Page 11: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Notice that in all the above examples, the ratio of B/A is always 2irrespective of what happens with taxa C.

10

5

=10 / 115

5 / 115

10 / 1015

5 / 1015= = 2

Normalization schemes can take advantage of working in ratio space

Relative abundance

Page 12: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Cells in the spreadsheet with few counts are largely structured by sequencing depth

Source: Gevers et al. - The Treatment-Naive Microbiome in New-Onset Crohn’s Disease - Cell Host Microbe 2014

Page 13: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Ordination without normalization leads to dependency of sequencing depth…

logLog10 (number of sequences) Bray-Curtis distance

Page 14: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

No normalization scheme eliminates the dependency of sequencing depth

Page 15: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

No normalization scheme eliminates compositional dependencies

Bioinformatics pipelines for 16S rRNA might consider explicitly tracking the number of sequences per samples as a potential confounder…

Page 16: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Sequencing depth can be correlated with input variables of interest…

Log10 (number of sequences)

NM

DS

1

Theta YC distance

Diff

ere

nce

in n

um

be

r

of

seq

ue

nce

s

Source: Baxter et al. - Structure of the gut microbiome following colonization with human feces determines colonic tumor burden - Microbiome 2014

Page 17: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Log10 (number of sequences) Log10 (number of sequences)

Log10 (number of sequences) Log10 (number of sequences)

Log10 (number of sequences) Log10 (number of sequences)

Log10 (number of sequences)

Theta YC distanceTheta YC distance

Theta YC distanceTheta YC distance

Theta YC distanceTheta YC distance

Theta YC distance

NM

DS

1N

MD

S 1

NM

DS

1

NM

DS

1

NM

DS

1N

MD

S 1

NM

DS

1

Diff

eren

ce in

num

ber

of s

eque

nces

Diff

eren

ce in

num

ber

of s

eque

nces

Diff

eren

ce in

num

ber

of s

eque

nces

Diff

eren

ce in

num

ber

of s

eque

nces

Diff

eren

ce in

num

ber

of s

eque

nces

Diff

eren

ce in

num

ber

of s

eque

nces

Diff

eren

ce in

num

ber

of s

eque

nces

Different normalization schemes can have very different consequences for inference..

Page 18: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

No normalization scheme eliminates compositional dependencies (although some do better than others!).

Bioinformatics pipelines for 16S rRNA should explicitly track number of sequences per samples as a potential confounding variable.

Just as no one statistical test is appropriate for inference, there islikely no one normalization scheme that will be appropriate for all datasets.

Conclusions

Page 19: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Raad Z. Gharaibeh

(We thank Dirk Gevers for providing a parsable OTU table for the Risk data)

Page 20: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

Cells in the spreadsheet with few counts are largely structured by sequencing depth

Source: Gevers et al. - The Treatment-Naive Microbiome in New-Onset Crohn’s Disease - Cell Host Microbe 2014

Page 21: Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte

In any experiment confounding variables can complicate inference..