Marker Gene Analysis: Best Practices

Marker Gene Analysis Best Practices

Susan Huse Marine Biological Laboratory /

Brown University October 17, 2012

Cleaning Data Filtering:

Remove reads that are likely to be overall low-quality and have errors throughout the read.

Quality Trimming: trim off nucleotides from the end(s) of the read based on local quality values.

Denoising: Adjust nucleotides that are more likely to be an error in base-calling (noise) than a true low-frequency variation (signal)

Anchor Trimming: trim the end of long amplicons to a conserved location in the SSU alignment

Chimera Removal: remove hybrid sequences created during amplification

Recommended 454 Filtering

•  Exact match to barcode and proximal primer

•  Optional denoising (currently only 454)

•  Remove sequences

–  with Ns

–  that are too short

–  Below average or window quality threshold

•  Trim to distal primer or anchor

–  Remove sequences without anchor / primer

SSU rRNA Anchor Trimming

Next-gen sequences often do not reach to the distal primer, and reads may have a range of lengths.

De novo OTU clustering and other sequence comparisons

are more consistent if all tags are trimmed to the same start and stop positions in the rRNA alignment.

Anchor trimming uses a highly conserved location situated

within the read length and truncates all reads to that position. Be careful that the anchor is the unique and present across all taxa.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40

Cum

ulat

ive

Per

cent

of E

rror

s

Quality Score

Quality Scores for Error Positions

Untrimmed Data

80% of error bases have a quality score <=16

Before trimming, most errors have low Q scores

An Illumina HiSeq Error Distribution

HiSeq Reads with Ns NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA!NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG!NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT!NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA!NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC!CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT!CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG!TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG!TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT!TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT!CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT!CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA!GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG!ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG!!

Illumina

In this dataset: •  68 reads contained at least 1 N, of these: •  14 (21%) could not be mapped to PhiX, •  7 of those 14 (50%) had only 1 N •  24 (35%) contain more than 1 N

Minoche Filtering for Illumina

Minoche A, et al. 2011. Genome Biology 12: R112 using Bambus vulgaris, Arabidopsis thaliana, and PhiX

Illumina Chastity (ChF)

Low-Quality (B) tails

Ns

<1/3 of nt Q<30 in 1st half

avgQ < 30 1st 30% of nt

Table 2: Expected error rates based on Q-scores (% of bases lost)

No filter

All filters

Remaining Errors

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40

Pct

of E

rror

s

Quality Score

Quality Scores for Error Positions

Trimmed Data Untrimmed Data

Illumina

PCR errors?

QIIME Illumina Pipeline

•  Single mismatch to barcode

•  Trim read to last position above quality threshold q

•  Remove sequences less than length threshold p

•  Remove sequences with more than n Ns

Paired-End Filtering

Read 1 (forward)

Read 2 (reverse)

A small insert size allows for sequence overlap

Area of sequence overlap

Keep only reads that match exactly throughout the region of overlap. Amplicons designed to completely overlap (e.g., V6) ensure the highest quality sequences.

But Variation Still Exists

Is this: 1.  systematic bidirectional sequencing error (unlikely) 2.  PCR error, or 3.  natural variation?

weblogo.berkeley.edu

5!

1

A

CT

2

A

G

3

T

A

G

4

CT

5

T

C6

C

T7

A

G

C

T8

A

G9

C

G

A10

T

A

C11

T

G

A12

CT

13

T

C14

T

A

C15

T

G

C

A16

T

A

C17

C

T

GA

18

A

G19

G

A20

G

CA

21

TG

C22

A

CT

23

CT

24

CT

25

T

C26

T

A

C27

T

GA

28

A

G29

GA

30

A

G31

G

TA

32

A

C

GT

33

A

G34

AG

35

TGA

36

G

C

AT

37

C

GAT

38

A

T

G

39

A

T

G

40

C

GT

41

A

C

G

42

G

T

C

43

T

C

44

A

CT

45

A

CT

46

T

G

C

47

C

A

G

48

A

G

49

A

G

50

GA

51

G

CA

52

A

T

C

53

A

GCT

54

C

A

T

G

55

A

C

GT

56

A

G

57

T

GA

58

A

G

59

T

GCA3!

E. coli K-12 V6 paired end with complete perfect overlap

What are Chimeras and

How do we find them?

5’ 3’

PCR primer primer anneals to complementary target

5’ 3’

Extension creates double-stranded amplicon

5’ 3’

Premature dissociation terminates elongation

3’

But…

The chimera can act as a template during the next PCR round.

5’ 3’

5’ 3’

The incomplete strand binds to a different template at a conserved region…

3’

conserved region

5’ 3’ …then extends to create a chimera

Chimera Detection 1.  Look for the best match to the left (left parent)

2.  Look for the best match to the right (right parent) Chimeric Read

Parent B

Parent A

Chimeric Read

3.  Compare the distance between the two parents – are they really different or multiple entries for the same organism

Parent A

Parent B

Detection methods differ by source of parents

1.  Reference Comparison: check against known reference sequences

2.  De novo detection: check all triplets in your amplification

Reference Comparison only as good as the Ref Set

•  Can only find parents if they are in the RefSet

•  Any chimeras in the Ref Set are deleterious!

•  Sparse RefSet may not detect chimeras from closely related organisms (intra-genera, intra-species)

•  Differential density of the Ref Set can create biases

•  Poor matches to the Ref Set can be mistaken for chimeras

•  Hard to detect if parents are similar, but may not matter

De Novo Pros and Cons

•  Can detect parents not in the RefSet: novel, close neighbors, PCR errors, unexpected amplifications

•  Must be run by amplification , ie. by tube All your parents but only your parents

•  Abundance profile can be tricky with long tail

•  Early False Positives (parent is lost to RefSet) and False Negatives (chimera add to RefSet) will affect downstream calls

We use both de novo and ref

0%

10%

20%

30%

40%

50%

60%

70%

0% 10% 20% 30% 40% 50%

Per

cenc

t of D

atas

ets

Percent of Reads that are Chimeric

Percent Chimeric for Various Datasets

V6V4 V3V5

Rates of Chimera Formation in BPC Datasets As a function of total reads, not unique sequences

Chimera detection programs optimized for short reads

•  UChime (in USearch, QIIME and VAMPS)

•  Perseus (in AmpliconNoise and mothur)

Aggregating

Taxonomic assignments will generally remain the same despite a few mismatches. More so at coarser taxonomic levels (class vs. genus) OTU Clustering can round out small percentages of errors depending on the algorithm used. Clustering at 3% can (but does not always!) aggregate sequences with 1 – 2% errors.

“Aggregating” is not accepted terminology in the field

Downstream analytical techniques that compensate for inaccuracies in the remaining sequence data.

Taxonomic Filtering

In addition to knowledge base associated with taxnomic names:

•  Can filter many unintended PCR amplification products.

•  Reads too far from the tree can be classified as “Unknown” and examined further.

•  Important to map reads to all domains, not just Bacteria, primers can amplify across domains and organelles

Amplification of other Domains

SSU region

Total Reads Archaea Bacteria Organelle Unknown

V6 529,359 0.02% 96% 4% 0.1%

V6-V4 3,437,855 0.3% 87% 8% 4%

Samples from Little Sippewissett Marsh. Organelles include mitochondria and chloroplasts

Non SSU rRNA Amplification

Thank you, Hilary

DNA binding transcriptional dual regulator, tyrosine-binding

Predicted antibiotic transporter

Putative transport system permease protein

Predicted major pilin subunit

16S rRNA

16S rRNA

Conserved inner membrane protein cardiolipin synthase

Taxonomy

GAST: Global Alignment of Sequence Taxonomy Use sequence alignment to compare against a RefSet Distance = alignment distance to nearest RefSet sequence (SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc) (VAMPS)

RDP:

Ribosomal Database Project Uses k-mer matching to find nearest genus Boot strap values reflect confidence in the assignment (RDP Training set, Greengenes, etc.) (QIIME, VAMPS)

•  Primer bias •  Chimeras •  Discovery of novel 16S •  Unrepresented in reference database •  Low-quality references •  Taxonomy not available •  Incorrect taxonomy in RefSet •  Ambiguous hypervariable sequence (>1 hit) •  RefSets often biased toward most studied

Sources of Error in Taxonomic Analyses

Creating OTUs:

Operational Taxonomic Units for taxonomy independent analyses

OTUs vs Taxonomy

•  Novel organisms

•  Many unnamed organisms

•  Some clades only defined to phyla or class

•  Many species names based on phenotype rather than genotype

•  Do not lump together all 16S “unknowns” or diverse partially classified.

Clustering Algorithms

Different clustering algorithms can have very different effects on the size and number of OTUs created…

Clustering Methods

De novo (open)

•  greedy clusters - test sequentially and incorporate sequence into first qualifying OTU. Dependent on input order.

•  average linkage - the average distance from a sequence to every other sequence in the OTU is less than the width. Dependent on input order. [complete and single linkage are other methods]

Reference (closed)

•  greedy - map each sequence to representative sequences defining prebuilt clusters

The Problem of OTU Inflation

De novo clustering algorithms return more OTUs than predicted for mock communities.

OTU inflation leads to:

•  alpha diversity inflation •  beta diversity inflation

Where does this inflation come from?

•  residual sequencing errors, •  chimeras, •  multiple sequence alignments, •  clustering algorithms

Rarefaction, Sample Size under OTU Inflation

0

1000

2000

3000

4000

5000

6000

7000

- 20,000 40,000 60,000 80,000 100,000 120,000

OT

Us

Number of Sequences Sampled

M2FN Rarefaction - PML

5K

10K

15K

20K

50K

100K

PML MS-CL

Rarefaction, Sample Size with minimal OTU Inflation

PML SLP-PW-AL

Cluster to Reference

1.  Create a comprehensive set of Cluster Representatives (e.g., new Greengenes) representing the breadth of Bacteria

2.  Assign each sequence to ClusterRep <= W

3.  If Seq is not a member of any cluster, set aside

4.  Cluster denovo the set of extra-cluster sequences

Advantages of clustering to full-length reference

•  Not as prone to OTU inflation •  Can add new data as available •  Provides static Cluster IDs

– Can be used to compare short reads from different regions (v3-v5 and v6)

– Can compare with other projects using same Ref Set

Oligotyping •  Further differentiation within closely related organisms

(e.g., genus)

•  Rather than blanket 3% clustering, select sequence positions with the most information (Shannon Entropy)

Fusobacterium oligtypes across oral sites

tons

ils

subg

ingi

val

plaq

ue

supr

agin

giva

l pl

aque

saliv

a

tong

ue

dors

um

thro

at

bucc

al

muc

osa

hard

pal

ate

kera

tiniz

ed

ging

iva

“But I’m not interested in the rare biosphere,

only the major players.

Can’t I just remove the low abundance OTUs?”

Consistent community profile across samples and environments

0

50

100

150

200

250

300

350

0 20 40 60 80 100

Cou

nt in

OTU

OTU Rank

700106784

0 100 200 300 400 500 600 700 800 900

0 50 100 150 200 250 300 350

Cou

nt in

OTU

OTU Rank

700038978

0

1000

2000

3000

4000

5000

6000

7000

0 50 100 150 200 250 300 350

Cou

nt in

OTU

OTU Rank

700023096

A small number of highly abundant organisms

A large number of low abundance organisms Rare Biosphere

Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120

Distribution of OTU relative abundances across 210 HMP stool samples

Huse et al. (2012) PLoS ONE

Distribution of OTU Absolute Abundances in English Channel Water Samples

Frequency in PML Samples

OTU

s

Distribution of OTU Absolute Abundances in English Channel Water Samples

Absent Singleton Doubleton 3-5 6-10 11-50 51-500 >500

Everything may not be everywhere,

but everything is rare somewhere!

If you feel you must remove low abundance OTUs, don’t do it until you have clustered

ALL of your samples

Alpha and Beta Diversity:

Impacts of Sampling Depth and Diversity Algorithm

-

200

400

600

800

1,000

1,200

1,400

1,600

1,800

- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000

Alpha Diversity - Richness

CL - ACE SLP - ACE CL - Chao SLP - Chao 1 in 5000 1 in 2500 1 in 1000 1 in 500

Alpha diversity metrics are sensitive to cluster method, sequencing depth and rare OTUs

0

1

1

2

2

3

3

4

4

5

- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000

Div

ersi

ty

Sampling Depth

Sampling Depth and Alpha Diversity

SLP - NPShannon SLP - Simpson CL - NPShannon Simpson

Robust to both singletons and depth

Comparing Different Sampling Depths

The “population” is a set of 50,000 reads from one sample The “samples” are randomly-selected subsets of sizes:

1,000 15,000 5,000 20,000 7,500 25,000 10,000

Calculate subsample diversity estimates across subsample

depths which are representing the same population.

Subsample 1,000 and 5,000 reads from sample of 50,000 reads, Pairwise distances for replicates at single depth

0

0.02

0.04

0.06

0.08

0.1

0.12 C

omm

unity

Dis

tanc

e

Replicates

Community Distance of Subsamples

Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K)

1000 5000

7500 10000

15000 20000 25000

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900

1.000

1000 5000 7500 10000 15000 20000 25000

Effect of Sample Depth - Bray Curtis

Bray Curtis uses absolute counts, intra-community distances are high as depths diverge

Nearly 100% Different

1,000 5,000

7,500 10,000

15,000 20,000 25,000

0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009

1,000 5,000 7,500 10,000 15,000 20,000

Effect of Sample Depth - Morisita Horn

Beta diversity metric that uses relative abundances and compensates for different sample sizes.

Distances are low across depths above min.sampling depth.

Nearly 0.5% Different

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-0.4 -0.2 0 0.2 0.4 0.6 0.8

PC 2

PC 1

SLP Clustering and Bray-Curtis

1,000 2,000 5,000 7,500 10,000 15,000 20,000 25,000 30,000 40,000

Bray-Curtis PCoA clusters entirely on depth (each point represents 10 atop one another)

!"#"$%&

!"#"$&

!"#""'&

!"#""(&

!"#"")&

!"#""%&

"&

"#""%&

"#"")&

"#""(&

"#""'&

"#"$&

!"#"$*& !"#"$& !"#""*& "& "#""*& "#"$&

!"#$#

!"#%#

&'!#"()*+,-./0#1.+2#34-.*.+5#64-/#

&$+"""&&

&*+"""&&

&,+*""&&

&$"+"""&&

&$*+"""&&

&%"+"""&&

&%*+"""&&

Minimum sample depth here of 10,000, but will be a function of the diversity of the sample

Acknowledgements

The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution

Mitch Sogin

David Mark Welch

Hilary Morrison Joe Vineis

A. Murat Eren

Anna Shipunova

Andy Voorhis

Sharon Grim

Why filter infrequent errors?

Ns Average 454 Error Rate

Errors / 400nt

Percent of Reads

0 or more 0.40% 1.6 100%

0 0.40% 1.6 99.3%

If we include all reads with or without Ns, we have an overall error rate of 0.4%.

If, however we remove all <1% of sequences with Ns, we have an overall error rate of 0.4%.

Why bother?? 454

Why filter infrequent errors?

Ns Average Error

Rate Errors / 400nt

Percent of Reads

0 0.40% 1.6 99.3% 1 1.11% 3.1 0.57% 2 3.81% 8.7 0.1% 3 7.26% 16.5 0.0% 4 8.40% 19.2 0.0% 5 10.46% 25.1 0.0%

It’s not just improving the overall error rate, but removing spurious data

Low-quality reads can be interpreted as unique organisms: 0.7% of 500,000 reads = 3,500 “unique organisms”

454 Error Distribution

454 Errors are not evenly distributed among reads: Many reads have only a small number of errors, and a small number of reads have many errors

Distribution of errors in short reads (<100nt)

454

Most reads contain no errors at all

A good beginning can mask a bad end

if last 100 have an average of 25 avg qual = ((350*35) + (100*25)) / 500 = 30

If 450 nt read and first 400nt average 35:

if last 50 have an average of 0 avg qual = ((400*35) + (50*0)) / 450 = 31

Longer reads, pushing the limits

454 Filter Summary Percent of Reads

Average Error Rate

Average Errors / 400 nt

N=0 99% 0.40% 1.6 N>=1 1% 0.91% 3.6

Exact Primer 95% 0.38% 1.5 Not Exact Primer 5% 0.84% 3.4

Average Qual >=30 98% 0.90% 3.6 Average Qual <30 2% 1.3% 5.2

454

454 Filter Summary (cont) Percent of Reads

Average Error Rate

Average Errors / 400 nt

Read Length (500 - 600nt)

99+% 0.39% 1.6

Read Length (<500, >600 nt)

0.1% 1.8% 7.2

Filtered 93% 0.36% 1.4 Unfiltered 7% 0.64% 2.6

454

Evaluating Chimeras (USearch)

Query Parent A

Parent B

Diffs: A,B: Q matches expected P a,b: Q matches other P p: A=B!=Q Votes: + for Model, 0 neutral, ! against Model Model: shows extent of Parent A and Parent B, xxxx is overlap matching A&B

Click on the bar to see the alignment

Initial Length: 277

Extent of your sequence

Extent of your match

Check for left and right parents: BLAST the left (1-175) BLAST the right (175 - 277)

277

175

100% Match to Fusobacterium

100% Match to Pseudomonas

1

175

Taxonomic Names

•  Bergey’s Taxonomic Outline – manual of taxonomic names for bacteria

•  List of Prokaryotic names with Standing in the Nomenclature (vetting process)

•  NCBI – similar taxonomy, but multiple “subs” (subclass, suborder, subfamily, tribe)

•  Archaea – a work in progress…

•  Fungi – another work in progress…

Cluster “Width” Diameter

Sequences are never more than D apart. (CL)

Radius Sequences are never more than R from seed. (SL, AL, Gr)

Average Linkage collapses errors

Cluster Count: 1

Clusters tend to be heavily dominated by their most abundant sequence, which strongly weights the average and smoothes the noise.

#1

Still lose outlier sequencing errors

Multiple sequencing errors still not clustered

Inflation in Action: Multiple Sequence Alignment

and Complete Linkage clustering

1,042 is a few more than the expected 2

Example MSA

18,156 sequences and 392 positions

Regardless of clustering algorithm, an MSA cannot fully align tags whose

sequences are too divergent

Relative Inflation

Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will descrease with sample complexity

The Magical 3%

NOT!

3% SSU OTUs = Species and

6% SSU OTUs = Genera

Clustering Questions

•  How meaningful are clusters functionally?

•  When is an errare rare and when is it an error?

•  Should it be included in an existing cluster or start its own?

•  How to place sequences if OTUs overlap?

•  What is the effect of residual low quality data or chimeras?

•  How sensitive are alpha and beta diversity estimates to clustering results?

Documents

Marker Gene Analysis: Best Practices