70
Marker Gene Analysis Best Practices Susan Huse Marine Biological Laboratory / Brown University October 17, 2012

Marker Gene Analysis: Best Practices

Embed Size (px)

DESCRIPTION

Talk given by Susan Huse at the QIIME/VAMPS Workshop in Boulder, CO on October 17th, 2012.

Citation preview

Page 1: Marker Gene Analysis: Best Practices

Marker Gene Analysis Best Practices

Susan Huse Marine Biological Laboratory /

Brown University October 17, 2012

Page 2: Marker Gene Analysis: Best Practices

Cleaning Data Filtering:

Remove reads that are likely to be overall low-quality and have errors throughout the read.

Quality Trimming: trim off nucleotides from the end(s) of the read based on local quality values.

Denoising: Adjust nucleotides that are more likely to be an error in base-calling (noise) than a true low-frequency variation (signal)

Anchor Trimming: trim the end of long amplicons to a conserved location in the SSU alignment

Chimera Removal: remove hybrid sequences created during amplification

Page 3: Marker Gene Analysis: Best Practices

Recommended 454 Filtering

•  Exact match to barcode and proximal primer

•  Optional denoising (currently only 454)

•  Remove sequences

–  with Ns

–  that are too short

–  Below average or window quality threshold

•  Trim to distal primer or anchor

–  Remove sequences without anchor / primer

Page 4: Marker Gene Analysis: Best Practices

SSU rRNA Anchor Trimming

Next-gen sequences often do not reach to the distal primer, and reads may have a range of lengths.

De novo OTU clustering and other sequence comparisons

are more consistent if all tags are trimmed to the same start and stop positions in the rRNA alignment.

Anchor trimming uses a highly conserved location situated

within the read length and truncates all reads to that position. Be careful that the anchor is the unique and present across all taxa.

Page 5: Marker Gene Analysis: Best Practices

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40

Cum

ulat

ive

Per

cent

of E

rror

s

Quality Score

Quality Scores for Error Positions

Untrimmed Data

80% of error bases have a quality score <=16

Before trimming, most errors have low Q scores

An Illumina HiSeq Error Distribution

Page 6: Marker Gene Analysis: Best Practices

HiSeq Reads with Ns NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA!NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG!NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT!NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA!NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC!CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT!CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG!TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG!TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT!TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT!CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT!CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA!GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG!ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG!!

Illumina

In this dataset: •  68 reads contained at least 1 N, of these: •  14 (21%) could not be mapped to PhiX, •  7 of those 14 (50%) had only 1 N •  24 (35%) contain more than 1 N

Page 7: Marker Gene Analysis: Best Practices

Minoche Filtering for Illumina

Minoche A, et al. 2011. Genome Biology 12: R112 using Bambus vulgaris, Arabidopsis thaliana, and PhiX

Illumina Chastity (ChF)

Low-Quality (B) tails

Ns

<1/3 of nt Q<30 in 1st half

avgQ < 30 1st 30% of nt

Table 2: Expected error rates based on Q-scores (% of bases lost)

No filter

All filters

Page 8: Marker Gene Analysis: Best Practices

Remaining Errors

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40

Pct

of E

rror

s

Quality Score

Quality Scores for Error Positions

Trimmed Data Untrimmed Data

Illumina

PCR errors?

Page 9: Marker Gene Analysis: Best Practices

QIIME Illumina Pipeline

•  Single mismatch to barcode

•  Trim read to last position above quality threshold q

•  Remove sequences less than length threshold p

•  Remove sequences with more than n Ns

Page 10: Marker Gene Analysis: Best Practices

Paired-End Filtering

Read 1 (forward)

Read 2 (reverse)

A small insert size allows for sequence overlap

Area of sequence overlap

Keep only reads that match exactly throughout the region of overlap. Amplicons designed to completely overlap (e.g., V6) ensure the highest quality sequences.

Page 11: Marker Gene Analysis: Best Practices

But Variation Still Exists

Is this: 1.  systematic bidirectional sequencing error (unlikely) 2.  PCR error, or 3.  natural variation?

weblogo.berkeley.edu

5!

1

A

CT

2

A

G

3

T

A

G

4

CT

5

T

C6

C

T7

A

G

C

T8

A

G9

C

G

A10

T

A

C11

T

G

A12

CT

13

T

C14

T

A

C15

T

G

C

A16

T

A

C17

C

T

GA

18

A

G19

G

A20

G

CA

21

TG

C22

A

CT

23

CT

24

CT

25

T

C26

T

A

C27

T

GA

28

A

G29

GA

30

A

G31

G

TA

32

A

C

GT

33

A

G34

AG

35

TGA

36

G

C

AT

37

C

GAT

38

A

T

G

39

A

T

G

40

C

GT

41

A

C

G

42

G

T

C

43

T

C

44

A

CT

45

A

CT

46

T

G

C

47

C

A

G

48

A

G

49

A

G

50

GA

51

G

CA

52

A

T

C

53

A

GCT

54

C

A

T

G

55

A

C

GT

56

A

G

57

T

GA

58

A

G

59

T

GCA3!

E. coli K-12 V6 paired end with complete perfect overlap

Page 12: Marker Gene Analysis: Best Practices

What are Chimeras and

How do we find them?

Page 13: Marker Gene Analysis: Best Practices

5’ 3’

PCR primer primer anneals to complementary target

5’ 3’

Extension creates double-stranded amplicon

5’ 3’

Premature dissociation terminates elongation

3’

But…

The chimera can act as a template during the next PCR round.

5’ 3’

5’ 3’

The incomplete strand binds to a different template at a conserved region…

3’

conserved region

5’ 3’ …then extends to create a chimera

Page 14: Marker Gene Analysis: Best Practices

Chimera Detection 1.  Look for the best match to the left (left parent)

2.  Look for the best match to the right (right parent) Chimeric Read

Parent B

Parent A

Chimeric Read

3.  Compare the distance between the two parents – are they really different or multiple entries for the same organism

Parent A

Parent B

Page 15: Marker Gene Analysis: Best Practices

Detection methods differ by source of parents

1.  Reference Comparison: check against known reference sequences

2.  De novo detection: check all triplets in your amplification

Page 16: Marker Gene Analysis: Best Practices

Reference Comparison only as good as the Ref Set

•  Can only find parents if they are in the RefSet

•  Any chimeras in the Ref Set are deleterious!

•  Sparse RefSet may not detect chimeras from closely related organisms (intra-genera, intra-species)

•  Differential density of the Ref Set can create biases

•  Poor matches to the Ref Set can be mistaken for chimeras

•  Hard to detect if parents are similar, but may not matter

Page 17: Marker Gene Analysis: Best Practices

De Novo Pros and Cons

•  Can detect parents not in the RefSet: novel, close neighbors, PCR errors, unexpected amplifications

•  Must be run by amplification , ie. by tube All your parents but only your parents

•  Abundance profile can be tricky with long tail

•  Early False Positives (parent is lost to RefSet) and False Negatives (chimera add to RefSet) will affect downstream calls

We use both de novo and ref

Page 18: Marker Gene Analysis: Best Practices

0%

10%

20%

30%

40%

50%

60%

70%

0% 10% 20% 30% 40% 50%

Per

cenc

t of D

atas

ets

Percent of Reads that are Chimeric

Percent Chimeric for Various Datasets

V6V4 V3V5

Rates of Chimera Formation in BPC Datasets As a function of total reads, not unique sequences

Page 19: Marker Gene Analysis: Best Practices

Chimera detection programs optimized for short reads

•  UChime (in USearch, QIIME and VAMPS)

•  Perseus (in AmpliconNoise and mothur)

Page 20: Marker Gene Analysis: Best Practices

Aggregating

Taxonomic assignments will generally remain the same despite a few mismatches. More so at coarser taxonomic levels (class vs. genus) OTU Clustering can round out small percentages of errors depending on the algorithm used. Clustering at 3% can (but does not always!) aggregate sequences with 1 – 2% errors.

“Aggregating” is not accepted terminology in the field

Downstream analytical techniques that compensate for inaccuracies in the remaining sequence data.

Page 21: Marker Gene Analysis: Best Practices

Taxonomic Filtering

In addition to knowledge base associated with taxnomic names:

•  Can filter many unintended PCR amplification products.

•  Reads too far from the tree can be classified as “Unknown” and examined further.

•  Important to map reads to all domains, not just Bacteria, primers can amplify across domains and organelles

Page 22: Marker Gene Analysis: Best Practices

Amplification of other Domains

SSU region

Total Reads Archaea Bacteria Organelle Unknown

V6 529,359 0.02% 96% 4% 0.1%

V6-V4 3,437,855 0.3% 87% 8% 4%

Samples from Little Sippewissett Marsh. Organelles include mitochondria and chloroplasts

Page 23: Marker Gene Analysis: Best Practices

Non SSU rRNA Amplification

Thank you, Hilary

DNA binding transcriptional dual regulator, tyrosine-binding

Predicted antibiotic transporter

Putative transport system permease protein

Predicted major pilin subunit

16S rRNA

16S rRNA

Conserved inner membrane protein cardiolipin synthase

Page 24: Marker Gene Analysis: Best Practices

Taxonomy

GAST: Global Alignment of Sequence Taxonomy Use sequence alignment to compare against a RefSet Distance = alignment distance to nearest RefSet sequence (SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc) (VAMPS)

RDP:

Ribosomal Database Project Uses k-mer matching to find nearest genus Boot strap values reflect confidence in the assignment (RDP Training set, Greengenes, etc.) (QIIME, VAMPS)

Page 25: Marker Gene Analysis: Best Practices

•  Primer bias •  Chimeras •  Discovery of novel 16S •  Unrepresented in reference database •  Low-quality references •  Taxonomy not available •  Incorrect taxonomy in RefSet •  Ambiguous hypervariable sequence (>1 hit) •  RefSets often biased toward most studied

Sources of Error in Taxonomic Analyses

Page 26: Marker Gene Analysis: Best Practices

Creating OTUs:

Operational Taxonomic Units for taxonomy independent analyses

Page 27: Marker Gene Analysis: Best Practices

OTUs vs Taxonomy

•  Novel organisms

•  Many unnamed organisms

•  Some clades only defined to phyla or class

•  Many species names based on phenotype rather than genotype

•  Do not lump together all 16S “unknowns” or diverse partially classified.

Page 28: Marker Gene Analysis: Best Practices

Clustering Algorithms

Different clustering algorithms can have very different effects on the size and number of OTUs created…

Page 29: Marker Gene Analysis: Best Practices

Clustering Methods

De novo (open)

•  greedy clusters - test sequentially and incorporate sequence into first qualifying OTU. Dependent on input order.

•  average linkage - the average distance from a sequence to every other sequence in the OTU is less than the width. Dependent on input order. [complete and single linkage are other methods]

Reference (closed)

•  greedy - map each sequence to representative sequences defining prebuilt clusters

Page 30: Marker Gene Analysis: Best Practices

The Problem of OTU Inflation

De novo clustering algorithms return more OTUs than predicted for mock communities.

OTU inflation leads to:

•  alpha diversity inflation •  beta diversity inflation

Where does this inflation come from?

•  residual sequencing errors, •  chimeras, •  multiple sequence alignments, •  clustering algorithms

Page 31: Marker Gene Analysis: Best Practices

Rarefaction, Sample Size under OTU Inflation

0

1000

2000

3000

4000

5000

6000

7000

- 20,000 40,000 60,000 80,000 100,000 120,000

OT

Us

Number of Sequences Sampled

M2FN Rarefaction - PML

5K

10K

15K

20K

50K

100K

PML MS-CL

Page 32: Marker Gene Analysis: Best Practices

Rarefaction, Sample Size with minimal OTU Inflation

PML SLP-PW-AL

Page 33: Marker Gene Analysis: Best Practices

Cluster to Reference

1.  Create a comprehensive set of Cluster Representatives (e.g., new Greengenes) representing the breadth of Bacteria

2.  Assign each sequence to ClusterRep <= W

3.  If Seq is not a member of any cluster, set aside

4.  Cluster denovo the set of extra-cluster sequences

Page 34: Marker Gene Analysis: Best Practices

Advantages of clustering to full-length reference

•  Not as prone to OTU inflation •  Can add new data as available •  Provides static Cluster IDs

– Can be used to compare short reads from different regions (v3-v5 and v6)

– Can compare with other projects using same Ref Set

Page 35: Marker Gene Analysis: Best Practices

Oligotyping •  Further differentiation within closely related organisms

(e.g., genus)

•  Rather than blanket 3% clustering, select sequence positions with the most information (Shannon Entropy)

Fusobacterium oligtypes across oral sites

tons

ils

subg

ingi

val

plaq

ue

supr

agin

giva

l pl

aque

saliv

a

tong

ue

dors

um

thro

at

bucc

al

muc

osa

hard

pal

ate

kera

tiniz

ed

ging

iva

Page 36: Marker Gene Analysis: Best Practices

“But I’m not interested in the rare biosphere,

only the major players.

Can’t I just remove the low abundance OTUs?”

Page 37: Marker Gene Analysis: Best Practices

Consistent community profile across samples and environments

0

50

100

150

200

250

300

350

0 20 40 60 80 100

Cou

nt in

OTU

OTU Rank

700106784

0 100 200 300 400 500 600 700 800 900

0 50 100 150 200 250 300 350

Cou

nt in

OTU

OTU Rank

700038978

0

1000

2000

3000

4000

5000

6000

7000

0 50 100 150 200 250 300 350

Cou

nt in

OTU

OTU Rank

700023096

A small number of highly abundant organisms

A large number of low abundance organisms Rare Biosphere

Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120

Page 38: Marker Gene Analysis: Best Practices

Distribution of OTU relative abundances across 210 HMP stool samples

Huse et al. (2012) PLoS ONE

Page 39: Marker Gene Analysis: Best Practices

Distribution of OTU Absolute Abundances in English Channel Water Samples

Frequency in PML Samples

OTU

s

Distribution of OTU Absolute Abundances in English Channel Water Samples

Absent Singleton Doubleton 3-5 6-10 11-50 51-500 >500

Page 40: Marker Gene Analysis: Best Practices

Everything may not be everywhere,

but everything is rare somewhere!

If you feel you must remove low abundance OTUs, don’t do it until you have clustered

ALL of your samples

Page 41: Marker Gene Analysis: Best Practices

Alpha and Beta Diversity:

Impacts of Sampling Depth and Diversity Algorithm

Page 42: Marker Gene Analysis: Best Practices

-

200

400

600

800

1,000

1,200

1,400

1,600

1,800

- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000

Alpha Diversity - Richness

CL - ACE SLP - ACE CL - Chao SLP - Chao 1 in 5000 1 in 2500 1 in 1000 1 in 500

Alpha diversity metrics are sensitive to cluster method, sequencing depth and rare OTUs

Page 43: Marker Gene Analysis: Best Practices

0

1

1

2

2

3

3

4

4

5

- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000

Div

ersi

ty

Sampling Depth

Sampling Depth and Alpha Diversity

SLP - NPShannon SLP - Simpson CL - NPShannon Simpson

Robust to both singletons and depth

Page 44: Marker Gene Analysis: Best Practices

Comparing Different Sampling Depths

The “population” is a set of 50,000 reads from one sample The “samples” are randomly-selected subsets of sizes:

1,000 15,000 5,000 20,000 7,500 25,000 10,000

Calculate subsample diversity estimates across subsample

depths which are representing the same population.

Page 45: Marker Gene Analysis: Best Practices

Subsample 1,000 and 5,000 reads from sample of 50,000 reads, Pairwise distances for replicates at single depth

0

0.02

0.04

0.06

0.08

0.1

0.12 C

omm

unity

Dis

tanc

e

Replicates

Community Distance of Subsamples

Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K)

Page 46: Marker Gene Analysis: Best Practices

1000 5000

7500 10000

15000 20000 25000

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900

1.000

1000 5000 7500 10000 15000 20000 25000

Effect of Sample Depth - Bray Curtis

Bray Curtis uses absolute counts, intra-community distances are high as depths diverge

Nearly 100% Different

Page 47: Marker Gene Analysis: Best Practices

1,000 5,000

7,500 10,000

15,000 20,000 25,000

0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009

1,000 5,000 7,500 10,000 15,000 20,000

Effect of Sample Depth - Morisita Horn

Beta diversity metric that uses relative abundances and compensates for different sample sizes.

Distances are low across depths above min.sampling depth.

Nearly 0.5% Different

Page 48: Marker Gene Analysis: Best Practices

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-0.4 -0.2 0 0.2 0.4 0.6 0.8

PC 2

PC 1

SLP Clustering and Bray-Curtis

1,000 2,000 5,000 7,500 10,000 15,000 20,000 25,000 30,000 40,000

Bray-Curtis PCoA clusters entirely on depth (each point represents 10 atop one another)

Page 49: Marker Gene Analysis: Best Practices

!"#"$%&

!"#"$&

!"#""'&

!"#""(&

!"#"")&

!"#""%&

"&

"#""%&

"#"")&

"#""(&

"#""'&

"#"$&

!"#"$*& !"#"$& !"#""*& "& "#""*& "#"$&

!"#$#

!"#%#

&'!#"()*+,-./0#1.+2#34-.*.+5#64-/#

&$+"""&&

&*+"""&&

&,+*""&&

&$"+"""&&

&$*+"""&&

&%"+"""&&

&%*+"""&&

Minimum sample depth here of 10,000, but will be a function of the diversity of the sample

Page 50: Marker Gene Analysis: Best Practices

Acknowledgements

The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution

Mitch Sogin

David Mark Welch

Hilary Morrison Joe Vineis

A. Murat Eren

Anna Shipunova

Andy Voorhis

Sharon Grim

Page 51: Marker Gene Analysis: Best Practices

Why filter infrequent errors?

Ns Average 454 Error Rate

Errors / 400nt

Percent of Reads

0 or more 0.40% 1.6 100%

0 0.40% 1.6 99.3%

If we include all reads with or without Ns, we have an overall error rate of 0.4%.

If, however we remove all <1% of sequences with Ns, we have an overall error rate of 0.4%.

Why bother?? 454

Page 52: Marker Gene Analysis: Best Practices

Why filter infrequent errors?

Ns Average Error

Rate Errors / 400nt

Percent of Reads

0 0.40% 1.6 99.3% 1 1.11% 3.1 0.57% 2 3.81% 8.7 0.1% 3 7.26% 16.5 0.0% 4 8.40% 19.2 0.0% 5 10.46% 25.1 0.0%

It’s not just improving the overall error rate, but removing spurious data

Low-quality reads can be interpreted as unique organisms: 0.7% of 500,000 reads = 3,500 “unique organisms”

Page 53: Marker Gene Analysis: Best Practices

454 Error Distribution

454 Errors are not evenly distributed among reads: Many reads have only a small number of errors, and a small number of reads have many errors

Distribution of errors in short reads (<100nt)

454

Most reads contain no errors at all

Page 54: Marker Gene Analysis: Best Practices

A good beginning can mask a bad end

if last 100 have an average of 25 avg qual = ((350*35) + (100*25)) / 500 = 30

If 450 nt read and first 400nt average 35:

if last 50 have an average of 0 avg qual = ((400*35) + (50*0)) / 450 = 31

Page 55: Marker Gene Analysis: Best Practices

Longer reads, pushing the limits

Page 56: Marker Gene Analysis: Best Practices

454 Filter Summary Percent of Reads

Average Error Rate

Average Errors / 400 nt

N=0 99% 0.40% 1.6 N>=1 1% 0.91% 3.6

Exact Primer 95% 0.38% 1.5 Not Exact Primer 5% 0.84% 3.4

Average Qual >=30 98% 0.90% 3.6 Average Qual <30 2% 1.3% 5.2

454

Page 57: Marker Gene Analysis: Best Practices

454 Filter Summary (cont) Percent of Reads

Average Error Rate

Average Errors / 400 nt

Read Length (500 - 600nt)

99+% 0.39% 1.6

Read Length (<500, >600 nt)

0.1% 1.8% 7.2

Filtered 93% 0.36% 1.4 Unfiltered 7% 0.64% 2.6

454

Page 58: Marker Gene Analysis: Best Practices

Evaluating Chimeras (USearch)

Query Parent A

Parent B

Diffs: A,B: Q matches expected P a,b: Q matches other P p: A=B!=Q Votes: + for Model, 0 neutral, ! against Model Model: shows extent of Parent A and Parent B, xxxx is overlap matching A&B

Page 59: Marker Gene Analysis: Best Practices

Click on the bar to see the alignment

Initial Length: 277

Extent of your sequence

Extent of your match

Page 60: Marker Gene Analysis: Best Practices

Check for left and right parents: BLAST the left (1-175) BLAST the right (175 - 277)

Page 61: Marker Gene Analysis: Best Practices

277

175

100% Match to Fusobacterium

100% Match to Pseudomonas

1

175

Page 62: Marker Gene Analysis: Best Practices

Taxonomic Names

•  Bergey’s Taxonomic Outline – manual of taxonomic names for bacteria

•  List of Prokaryotic names with Standing in the Nomenclature (vetting process)

•  NCBI – similar taxonomy, but multiple “subs” (subclass, suborder, subfamily, tribe)

•  Archaea – a work in progress…

•  Fungi – another work in progress…

Page 63: Marker Gene Analysis: Best Practices

Cluster “Width” Diameter

Sequences are never more than D apart. (CL)

Radius Sequences are never more than R from seed. (SL, AL, Gr)

Page 64: Marker Gene Analysis: Best Practices

Average Linkage collapses errors

Cluster  Count:     1  

Clusters  tend  to  be  heavily  dominated  by  their  most  abundant  sequence,  which  strongly  weights  the  average  and  smoothes  the  noise.    

#1  

Page 65: Marker Gene Analysis: Best Practices

Still lose outlier sequencing errors

Multiple sequencing errors still not clustered

Page 66: Marker Gene Analysis: Best Practices

Inflation in Action: Multiple Sequence Alignment

and Complete Linkage clustering

1,042 is a few more than the expected 2

Page 67: Marker Gene Analysis: Best Practices

Example MSA

18,156 sequences and 392 positions

Regardless of clustering algorithm, an MSA cannot fully align tags whose

sequences are too divergent

Page 68: Marker Gene Analysis: Best Practices

Relative Inflation

Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will descrease with sample complexity

Page 69: Marker Gene Analysis: Best Practices

The Magical 3%

NOT!

3% SSU OTUs = Species and

6% SSU OTUs = Genera

Page 70: Marker Gene Analysis: Best Practices

Clustering Questions

•  How meaningful are clusters functionally?

•  When is an errare rare and when is it an error?

•  Should it be included in an existing cluster or start its own?

•  How to place sequences if OTUs overlap?

•  What is the effect of residual low quality data or chimeras?

•  How sensitive are alpha and beta diversity estimates to clustering results?