44
Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanzén Dept. of Biology & Centre for Geobiology University of Bergen Tuesday, 17 April 2012

Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Amplicon Sequencing, De-Noising And Diversity

Estimation

Anders LanzénDept. of Biology & Centre for Geobiology

University of BergenTuesday, 17 April 2012

Page 2: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Outline

• Introduction, sequencing technologies

• Different types of de-noising and filtering

• AmpliconNoise

• Chimera removal

• OTU Clustering

• Parametric diversity estimation

Tuesday, 17 April 2012

Page 3: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Introduction

• NGS has revolutionised microbial ecology

• Diversity - number of unique sequences / OTUs

• Taxonomical structure - what is in the sample?

• In prokaryotes, typically SSU rRNA (16S)

• First results from studies were striking (e.g. Sogin: 30k species from 600k reads), but vastly overestimated diversity of “the rare biosphere” [Kunin et al 2009, Quince et al 2009]

Tuesday, 17 April 2012

Page 4: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

NGS Technologies

Name~Reads /

run~Cost /

run~Read length

Amplicon-Noise

454 GS Titanium 1M €10,000 450 bp Yes

IonTorrent 2M €1000 240 bp Soon?

Illumina HiSeq

80M (per lane, PE) €2,000 100-150 bp No

Tuesday, 17 April 2012

Page 5: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

The noise problem

• No resequencing possible (no cloning step)

• 454: 0.25% error rate after filtering --> 3% of reads with > 4% incorrect bases. -->Huge overprediction of OTUs without filtering (>50,000 per run). +PCR noise, incl. chimeric sequences

Tuesday, 17 April 2012

Page 6: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Denoising Pipelines454 Sequencing

Filtering

Clustering(once or more)

Chimera removal

OTU Clustering and taxonomical class.

Base-calling (not in AmpliconNoise)

Tuesday, 17 April 2012

Page 7: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Noise Removal PipelinesAlgorithm Distance  metric Algorithm Source

PyroNoise  Quince  et  al.  2009

Flowgram Probabilis:c  -­‐  itera:ve

PyrotaggerKunin  et  al.  2010

Sequence Minimum  distance  threshold  –  one  pass  agglomora:ve

Pyrotagger  website

Single-­‐linkage  preclustering  (SLP)Huse  et  al.  2010  

Sequence Minimum  distance  threshold  –  one  pass  agglomora:ve

Vamps  website

DeNoiserReeder  &  Knight  2010  

Flowgram Minimum  distance  threshold  -­‐  agglomora:ve

QIIME

AmpliconNoise  Quince  et  al.  BMC  Bioinforma:cs  2011

PyroNoise  -­‐  flowgramSeqNoise  -­‐  sequence

Probabilis:c  -­‐  itera:ve Google  code  -­‐  hWp://code.google.com/p/ampliconnoise/Erick  Matsen  -­‐  hWps://github.com/Zcrc/ampliconnoiseMOTHUR

Tuesday, 17 April 2012

Page 8: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

FlowgramsThe pyrosequencing raw data

• Sequence is read as light intensity when bases are washed over 454 plate in order T C A G

Example:0.03 1.03 0.09 0.12 0.89 0.09 0.09 1.01 0.11 1.03 0.12 0.12 2.00 0.12 1.92

Translation:CTGCTTAA

Tuesday, 17 April 2012

Page 9: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Flowgrams• Since intensity values are inexact, similar

flowgrams can lead to very different sequences after base-calling

Read #1 TGGGGGCAAAAA |||||| ||||Read #2 TGGGGGGCAAAA

Similarity = 10/12 = 83%

A G T C A

4.47

0.960.04

5.52

0.01

4.51

1.010.05

5.49

0.02

Read #1 Read #2

• Solution: Flowgram-to-sequence clustering. This needs distance metric!

Tuesday, 17 April 2012

Page 10: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

• Calculation of probability density of P(I|n) where I = signal intensity, n= homopolymer length

Tuesday, 17 April 2012

Page 11: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

AmpliconNoise

Removes  PCR  chimeras

Split  on  barcodes

Filter  flowgrams

Extract  flowgrams

PyroNoise:  probabilis:c  flowgram  clusterer

SeqNoise:  probabilis:c  sequence  clusterer

Perseus:  chimera  classifier

OTU  construnc:on:  using  complete  linkage  and  exact  pairwise  alignments

Remove  PCR  single  base  errors

Samples  in  454  sff  file  format

Removes  pyrosequencing  errors

Use  best  available  OTU  construc>on  algorithm

Tuesday, 17 April 2012

Page 12: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

1. Pre-filtering: Truncates at base 360 or first noisy flow

2. PyroDist: Predicts the distance matrix between all flowgrams and perfect flowgrams (in first step taken from hierarchical clustering of flowgrams to base called sequences)

3. PyroNoise: Bayesian probability of perfect flowgram coming from a particular flow --> Reassignment of flows based on the most likely sequence it comes from, until convergence.Iterative ML algorithm

AmpliconNoise Algorithm

Tuesday, 17 April 2012

Page 13: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

4. Base calling, to unique sequence cleaned from sequencing noise

5. SeqDist and SeqNoise: Cleaning of PCR point mutation artefacts.

6. Perseus (or PerseusM): Probabilistic self-referenced chimera removal

AmpliconNoise Algorithm

Tuesday, 17 April 2012

Page 14: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Chimera removal - perseus• Assumes that a chimera’s

parents will be in the dataset with equal or greater frequency than the chimera

• For each sequence directly• search for possible parents and break points that give a

close match

• Only search for parents amongst more abundant sequences

• Define chimera index – reflecting the probability that sequence was generated by evolution

Tuesday, 17 April 2012

Page 15: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Results

• Pyrosequencing (FLX) of artificial community:

• 90 clones with “known sequence” i e Sanger-sequenced to depth.

• V5 + V6 hypervariable regions

• Different concentrations

• Multiple alignment to determine OTUs at different cutoffs.

• The correct result known!

Tuesday, 17 April 2012

Page 16: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Benchmarking

Tuesday, 17 April 2012

Page 17: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Benchmarking

Tuesday, 17 April 2012

Page 18: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

No more “Rare Biosphere”?

Tuesday, 17 April 2012

Page 19: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

• Some  environments  s:ll  appear  very  diverse

No more “Rare Biosphere”?

Tuesday, 17 April 2012

Page 20: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

• Some  environments  s:ll  appear  very  diverse• Ric  valley  soda  lakes  –  11,835  filtered  reads  gave  863  3%  OTUs  without  noise  removal

No more “Rare Biosphere”?

Tuesday, 17 April 2012

Page 21: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

• Some  environments  s:ll  appear  very  diverse• Ric  valley  soda  lakes  –  11,835  filtered  reads  gave  863  3%  OTUs  without  noise  removal

• Following  noise  removal  obtain  585  OTUs  a  1/3  reduc:on  in  observed  diversity  

No more “Rare Biosphere”?

Tuesday, 17 April 2012

Page 22: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

• Some  environments  s:ll  appear  very  diverse• Ric  valley  soda  lakes  –  11,835  filtered  reads  gave  863  3%  OTUs  without  noise  removal

• Following  noise  removal  obtain  585  OTUs  a  1/3  reduc:on  in  observed  diversity  

• Abundance  distribu:on  s:ll  highly  skewed

No more “Rare Biosphere”?

Tuesday, 17 April 2012

Page 23: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

How to Use

• http://code.google.com/p/ampliconnoise/

• Manual with installation instructions and detailed description of scripts, programs etc.

• Requires MPI and the GNU Science Library (gsl) to run on cluster or SMP (for all but small datasets)

• Big datasets can be pre-clustered to reduce time.

Tuesday, 17 April 2012

Page 24: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Convenience Scripts

• RunPreSplit: Normal pipeline for de-noising and OTU-clustering a one-barcode sample from GS Titanium

• RunPreSplitXL: For bigger datasets that require pre-splitting step

• RunTitanium: Normal pipeline for de-noising SFF file with all barcodes in it

• ampliconflow.jar: Java library for processing of SFF files and OTUs

• Also gives statistics and some diversity indeces

Tuesday, 17 April 2012

Page 25: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

ConceptsOperational Taxanomic Unit (OTU)

• Used instead of species concept

• Typically 97% similarity cluster of 16S gene

• Meaningful?

• Ecotype studies suggest 99%

• Multiple 16S genes

• Subunit cloning

Tuesday, 17 April 2012

Page 26: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

OTU Clustering

• Purpose: construct clusters with similar sequences, e.g. to estimate diversity more robustly

• Agglomerative hierarchical clustering or “linkage clustering”

• Distance matrix from multiple alignment used for distance between sequences

• Produces OTUs (Operational Taxonomic Units), typically with 3% maximum distance limit

Tuesday, 17 April 2012

Page 27: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Linkage Clustering

Start by linking together sequences with the least distance.

Then group together clusters, as long as distance is <3%.

Tuesday, 17 April 2012

Page 28: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Linkage Clustering

• 3 Types of Linkage clustering, using different ways to measure distance between clusters

• 1) Maximum / Complete Linkage Clustering

• 2) Minimum / Single Linkage Clustering

• 3) Average Linkage Clustering (UPMGA)

Tuesday, 17 April 2012

Page 29: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 30: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 31: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 32: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 33: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 34: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 35: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 36: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Tuesday, 17 April 2012

Page 37: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Chao-1

• Commonly used, non-parametric heuristic (Chao, 1987)

• Based on #OTUs, #singletons and #doubletons

• Gives lower-bound of diversity

• Increases with sequencing depth and very sensitive to un-removed noisy sequences

Tuesday, 17 April 2012

Page 38: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

From Gihring et al (2009)Pyrosequencing exacerbates sample size bias 3

© 2011 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology

Tuesday, 17 April 2012

Page 39: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

A Novel Bayesian Approach

• Quince et al (2008), “The rational exploration of microbial diversity”, ISME Journal (to appear)

• Bayes’ theorem:Likelihood ∝ P(data|parameters) × P(parameters)

In this case: find the best fit of the observed Taxa-abundance distribution and diversity to an underlying parametric distribution and sample diversity

Tuesday, 17 April 2012

Page 40: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Parametric Bayesian Approach (Quince et al, 2009)

• 3 different distributions tried:• Log-normal• Inverse Gaussian• Sichel distribution

• Some others shown to be poorer fit to samples (Exponential, gamma and mixtures)

• Maximise posterior probability using Markov chain Monte-Carlo sampling of parameters

• Verified on census rainforest tree dataset (Barro Colorado Island complete census of 222,655 trees identified to 303 species)

Tuesday, 17 April 2012

Page 41: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Application to Arctic soilsName Soil  type Age  

(yrs)Sample  size

Clean  no. 3%  OTUs 3%  Chao

Midtre  Lovenbre  7 Tundra 2000 33,445 23,105 1,936 2,751

Midtre  Lovenbre  1 Rock 10 35,569 24,004 1,578 2,339

Knutsenheia Desert n.a. 21,474 14,338 1,599 2,459

Storholmen Island 1000 23,679 14,586 1,554 2,349

14th  July  no.  2 Bird  cliffs n.a. 32,661 23,264 1,763 2,405

14th  July  no.  3 Bird  cliffs n.a. 19,796 11,846 1,398 2,056

Tuesday, 17 April 2012

Page 42: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Arctic soils

Tuesday, 17 April 2012

Page 43: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

Arctic soilsName Log-­‐normal Inverse  Gaussian Sichel

Midtre  Lovenbre  7 4542:5369:6546 3615:4006:4520 2994:3234:3565  

Midtre  Lovenbre  1 3787:4628:5994 2939:3307:3803 2421:2654:2990  

Knutsenheia 3892:4728:6032 3179:3616:4275 2752:3117:3934  

Storholmen 4099:5126:6758 3208:3705:4438 2746:3151:3905  

14th  July  no.  2 3815:4476:5427 2979:3257:3624 2698:2933:3282  

14th  July  no.  3 2990:3552:4420 2521:2827:3281 2176:2416:2777

Tuesday, 17 April 2012

Page 44: Amplicon Sequencing, De-Noising And Diversity Estimation · Amplicon Sequencing, De-Noising And Diversity Estimation Anders Lanz n Dept. of Biology & Centre for Geobiology ... The

90% sampling efforts

Name Log-­‐normal Inverse  Gaussian Sichel

Midtre  Lovenbre  7 1.92e+06 2.58e+05 1.16e+05

Midtre  Lovenbre  1 2.94e+06 2.84e+05 1.26e+05

Knutsenheia 1.37e+06 1.97e+05 1.16e+05

Storholmen 2.25e+06 2.32e+05 1.32e+05

14th  July  no.  2 1.61e+06 1.88e+05 1.25e+05

14th  July  no.  3 5.98e+05 1.19e+05 6.63e+04

Tuesday, 17 April 2012