20
9/15/2008 1 Sequencing the World of Possibilities for Energy & Environment Processing of metagenomic data. Data reduction. Konstantinos Mavrommatis Genome Biology Program KMa rommatis@lbl go KMavrommatis@lbl.gov Sequencing the World of Possibilities for Energy & Environment Metagenomic analysis S l Et t DNA Sh t l Hi h th h t pyrosequencing Sample Extract DNA Shotgun clone High throughput sequence 3, 8, 40 kb Assemble reads Call genes Bin fragments NO FINISHING!

Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

1

Sequencing the World of Possibilities for Energy & Environment

Processing of metagenomicdata. Data reduction.

Konstantinos MavrommatisGenome Biology Program

KMa rommatis@lbl [email protected]

Sequencing the World of Possibilities for Energy & Environment

Metagenomic analysis

S l E t t DNA Sh t l Hi h th h t

pyrosequencing

Sample Extract DNA Shotgun clone High throughput sequence

3, 8, 40 kb

Assemble reads Call genes Bin fragmentsNO FINISHING!

Page 2: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

2

Sequencing the World of Possibilities for Energy & Environment

• Large datasetsLarge number of genes– Large number of genes difficult to analyze manually.

– Redundant sequences (used for copy estimates).

– Loss of gene context information.

• Improvement of gene calls.• Data reduction (database

storage comp tationstorage, computation efficiency).

• Preservation of gene context information.

Sequencing the World of Possibilities for Energy & Environment

Overview

• Processing of data produced using Sanger sequencing technology.

• Handling of 454 data.

Page 3: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

3

Sequencing the World of Possibilities for Energy & Environment

Abundance comparisons

Sequencing the World of Possibilities for Energy & Environment

Gene prediction

• Quality of assembly affects gene prediction.– “Greedy” Phrap allows larger number of

predicted genes in contigs… but high error rate.

• Effect of gene prediction algorithm.– Fgenes/Genemark predict genes more

accurately.– Fgenes/Genemark better represent the real

content (function comparisons).– Metagene has higher rate of wrong

predictions.

isolateisolate

metagenomemetagenome

Mavromatis K. etal.Nat Method Jun;4(6)2007.

Page 4: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

4

Sequencing the World of Possibilities for Energy & Environment

Improvement of gene calls

• Compare genes against the nucleotide sequence of contigs/reads.

• Identify – correct gene calling errors.

• Identify – cluster similar sequences (>85% id.).

• Protein based assembly.Protein based assembly.

Sequencing the World of Possibilities for Energy & Environment

Gene correction

Select the gene with hit to a COG/ pfam/TIGRfam or the longest one.

Merge two genes that have hits to the same COG.

Page 5: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

5

Sequencing the World of Possibilities for Energy & Environment

Gene clustering

• Find highly similar g ysequences and create gene clusters.

85% identity

Sequencing the World of Possibilities for Energy & Environment

Scaffold protein assembly

Use the order of genes to merge scaffolds.•Reduced number of genes/scaffolds.•Preservation of Gene context

Page 6: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

6

Sequencing the World of Possibilities for Energy & Environment

Results

• Reduced size of metagenome.• Increased average scaffold size.

– 1094 1210• Increased average gene size.

– 627 692

Sequencing the World of Possibilities for Energy & Environment

Functional content

• No loss of functional content– 78 genes have different or no Pfam hits.– Annotation is more precise. Pfams with not

significant hits were removed.

Pfam COGsTermite Gut #3 2138 2262Termite Gut #3Processed T.gut 3c 2079 2233

Page 7: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

7

Sequencing the World of Possibilities for Energy & Environment

Phylogenetic content

Sequencing the World of Possibilities for Energy & Environment

Super scaffolds and scaffold groups

• 1784 can be connected linearly.• 846 super scaffolds• 846 super scaffolds.

• 5933 contigs can be connected.• 1076 groups of scaffolds.

Page 8: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

8

Sequencing the World of Possibilities for Energy & Environment

Population variation.

Sequencing the World of Possibilities for Energy & Environment

Overview

• Processing of data produced using Sanger sequencing technology.

• Handling of 454 data.

Page 9: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

9

Sequencing the World of Possibilities for Energy & Environment

454

Facts• Large size of sequenced data.• Short reads.

– minimal assembly.– no ab initio gene prediction.

QuestionsG di ti ?• Gene prediction?

• Metabolic reconstruction?• Binning?• Storage in databases?• Do we get a good representation of the community?

Sequencing the World of Possibilities for Energy & Environment

454

COGsPfam

.

.

rpsblastHMMer

.tblastx rpsblast

HMMer

Sensitivity?Accuracy?

Proxygenes

Page 10: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

10

Sequencing the World of Possibilities for Energy & Environment

Proxy gene clusters

function A

function A

f ti Afunction A

function A

Sequencing the World of Possibilities for Energy & Environment

454 simulated datasets

• 3 datasetsDataset Organism

Clostridium phytofermentans ISDg

P hl i NATL2A• 3 datasets.• 4 different

sequencing coverages (0.1X, 1X, 2X, 4X)

M1

Prochlorococcus marinus NATL2A

Lactobacillus reuteri 100-23

Caldicellulosiruptor saccharolyticus DSM 8903

Clostridium sp. OhILAs

Herpetosiphon aurantiacus ATCC 23779

Bacillus weihenstephanensis KBAB4

Halothermothrix orenii H168

Clostridium cellulolyticum H10

M2

Geobacter sp. FRC-32

Burkholderia multivorans ATCC 17616

Delftia acidovorans SPH-14X). M2 Delftia acidovorans SPH 1

Comamonas testosteroni KF-1

Geobacter lovleyi SZ

M3

Shewanella putrefaciens CN-32

Shewanella loihica PV-4

Halorhodospira halophila SL1

Pseudomonas putida F1

Shewanella baltica OS195

Bifidobacterium longum bv. Infantis ATCC 15697

Stenotrophomonas maltophilia R551-3

Parvibaculum lavamentivorans DS-1

Page 11: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

11

Sequencing the World of Possibilities for Energy & Environment

Accuracy

• Accuracy of assignment is notAccuracy of assignment is not affected by e-value threshold.

• The number of assigned reads increases with loser thresholds.

Sequencing the World of Possibilities for Energy & Environment

• Sampling does not mixSampling does not mix the datasets.

• Coverage effect. Higher coverage = more accurate representation.

• Proxy-cluster effect Proxy• Proxy-cluster effect. Proxy clusters are very similar to the proxy genes.

Page 12: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

12

Sequencing the World of Possibilities for Energy & Environment

Phylogenetic assignment

Sequencing the World of Possibilities for Energy & Environment

Data reduction

101000

151000

201000

251000

301000

351000

M1 - BBHM1 - ClusteringM2 - BBHM2 - ClusteringM3 - BBHM3 - Clustering

1000

51000

0 1 2 3 4

Coverage

Page 13: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

13

Sequencing the World of Possibilities for Energy & Environment

Thanks to…

G Bi l GGenome Biology Group• Nikos Kyrpides• Natalia Ivanova• Thanos Lykidis• Iain Anderson• Sean Hooper• Natalia Mikhailova

BTMTC group•Victor Markowitz•Ernest Szeto•Krishna Palaniappan•Yuri Grechkin• Galina Ovchinnikova •Yuri Grechkin•Amy Chen•Ken Chu•Daniel Dalevi

Sequencing the World of Possibilities for Energy & Environment

Overview

• Methods to evaluate the quality of algorithms used for metagenome analyses.

• Handling of 454 data.

• Reduction of redundancy in metagenomic dataset.

Page 14: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

14

Sequencing the World of Possibilities for Energy & Environment

Metagenomic datasets

High complexityHigh complexityLow complexityLow complexity Medium complexityMedium complexity High complexityHigh complexity(Farm soil)(Farm soil)

Tringe et al.Tringe et al.

Low complexityLow complexity(EBPR sludge)(EBPR sludge)

Garcia Martin et al.Garcia Martin et al. Tyson et al.Tyson et al.

Medium complexityMedium complexity(Acid mine drainage)(Acid mine drainage)

Science 2005Science 2005Nature Biotech 2006Nature Biotech 2006 Nature 2004Nature 2004

Sequencing the World of Possibilities for Energy & Environment

Metagenomic analysis

• Methods for assembly and gene prediction were developed for i l l iisolate genomes analysis.

• How accurate are they in metagenomes?• Binning methods have not been compared to each other.

• Create simulated metagenomes from isolate genomes by combining random paired reads.

• Assemble, predict genes, bin the “metagenomes”.• Track reads of individual organisms in the assembled contigs and bins.• Evaluate each assembly method.• Compare gene predictions against isolate genomes.• Evaluate binning methods.• Provide datasets for benchmarking of new methods.

Page 15: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

15

Sequencing the World of Possibilities for Energy & Environment

4

5

6

SimLC(e.g. sludge)

Simulated Datasets composition

0

1

2

3

2

3

4

5

6

SimMC(e.g. AMD)

vera

ge (

x)ve

rage

(x)

All 3 simsets comprise ~100,000 readsAll 3 simsets comprise ~100,000 reads

0

1

2

3

4

5

60

1

SimHC(e.g. soil)

Cov

Cov

isolate genomesisolate genomes

Sequencing the World of Possibilities for Energy & Environment

Assembly overview

• Assembly:– Phrap (Phil Green, Univ. Washington).– JAZZ (JGI).– Arachne (Broad Institute, MIT).

Page 16: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

16

Sequencing the World of Possibilities for Energy & Environment

Assembly fidelity

Homogeneous - one isolate genome only: fidelity = 100%

Chimera of two strains of same species: fidelity = 70% (but same species)

Chimera of multiple species: fidelity = 90%

Sequencing the World of Possibilities for Energy & Environment

Assembly comparison

eric

seq

uenc

e

of c

him

eris

m

% c

him

e

Deg

ree

o

Page 17: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

17

Sequencing the World of Possibilities for Energy & Environment

Assembly summary

• Approximately 80 -90 % of the contigs are homogeneous.• Small contigs (<8 Kb) are more likely to be chimeric (simLC).• The presence of closely related species creates chimeric contigs by mixing reads from

different strains in simMC.• SimHC is practically unassembled.• Phrap is the “greediest” assembler. Produces the highest number of chimeric contigs.• Arachne (and JAZZ) is more conservative and produces better quality contigs.

Common sequences between reads of different organisms result in “good” alignments…and chimeric contigs… and chimeric contigs.

• Bad trimming. – Vector sequences at the ends. – Poor quality sequence.

• Common sequences (phages, transposases, ABC transporters).• Low thresholds for similarity/length of alignment.

Sequencing the World of Possibilities for Energy & Environment

Gene prediction overview

• Gene prediction:Ge e p ed ct o– Evidence based/ab initio prediction:

• ORNL pipeline (Critica, GLIMMER).

– ab initio prediction:fgenesb (SoftBerry).

metagene (Nog chi et al NAR 2006)metagene (Noguchi et al., NAR 2006).

Page 18: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

18

Sequencing the World of Possibilities for Energy & Environment

Errors in gene prediction

• Errors usually at the end of reads/contigs.• Bad trimming. Vector sequences at the ends are “interpreted” as g q p

fragments of genes.• Poor quality sequence (frameshifts, N). Produces fragments on the

wrong strand.• Regions of chimeric assembly.

Sequencing the World of Possibilities for Energy & Environment

Binning overview

• Pattern discovery [domain - family] (PhyloPythia). • Sequence composition [family](kmer).

7 NNNNNNN 8 NN NN NN– 7mer NNNNNNN, 8mer NNxNNxNN.• Assignment based on blast hits [class] (BLAST distr).

– Databases with 130 and ~600 genomes.

Ideal scenario

stra

ins)

In reality

h as

m

ains

,es

)

isol

ates

(p

refe

rabl

y s

Bins

Bins

isol

ates

(g

roup

s su

cfa

mili

es, d

o mra

rely

spe

cie

Page 19: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

19

Sequencing the World of Possibilities for Energy & Environment

Binning comparison (8 Kb contigs)

Specificity (8 Kb)

0 9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

BLAS

T di

str (

v2) 1

- Ar

achn

e

BLAS

T di

str (

v2) 1

- JA

ZZ

BLAS

T di

str (

v2) 1

- Ph

rap

BLAS

T di

str (

v2) 2

- Ar

achn

e

BLAS

T di

str (

v2) 2

- JA

ZZ

BLAS

T di

str (

v2) 2

- Ph

rap

BLAS

T di

str 1

- Ar

achn

e

BLAS

T di

str 1

- JA

ZZ

BLAS

T di

str 1

- Ph

rap

BLAS

T di

str 2

- Ar

achn

e

BLAS

T di

str 2

- JA

ZZ

BLAS

T di

str 2

- Ph

rap

hylo

pyth

ia (p

:0.8

5) -

Arac

hne

n Ph

ylop

ythi

a (p

:0.8

5) -

JAZZ

n Ph

ylop

ythi

a (p

:0.8

5) -

Phra

p

hylo

pyth

ia (p

:0.8

5) -

Arac

hne

p Ph

ylop

ythi

a (p

:0.8

5) -

JAZZ

p Ph

ylop

ythi

a (p

:0.8

5) -

Phra

p

kmer

(8m

er) -

Ara

chne

kmer

(8m

er) -

JAZ

Z

kmer

(8m

er) -

Phr

ap

spec

ifici

tysimLCsimMC

simHC

Sensitivity (8 Kb)

0.8

0.9

1

B B

gen

Ph gen

gen

ssp

Ph ssp

ssp

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

BLAS

T di

str (

v2) 1

- Ar

achn

e

BLAS

T di

str (

v2) 1

- JA

ZZ

BLAS

T di

str (

v2) 1

- Ph

rap

BLAS

T di

str (

v2) 2

- Ar

achn

e

BLAS

T di

str (

v2) 2

- JA

ZZ

BLAS

T di

str (

v2) 2

- Ph

rap

BLAS

T di

str 1

- Ar

achn

e

BLAS

T di

str 1

- JA

ZZ

BLAS

T di

str 1

- Ph

rap

BLAS

T di

str 2

- Ar

achn

e

BLAS

T di

str 2

- JA

ZZ

BLAS

T di

str 2

- Ph

rap

gen

Phyl

opyt

hia

(p:0

.85)

- Ar

achn

e

gen

Phyl

opyt

hia

(p:0

.85)

- JA

ZZ

gen

Phyl

opyt

hia

(p:0

.85)

- Ph

rap

ssp

Phyl

opyt

hia

(p:0

.85)

- Ar

achn

e

ssp

Phyl

opyt

hia

(p:0

.85)

- JA

ZZ

ssp

Phyl

opyt

hia

(p:0

.85)

- Ph

rap

kmer

(8m

er) -

Ara

chne

kmer

(8m

er) -

JAZ

Z

kmer

(8m

er) -

Phr

ap

sens

itivi

ty

simLC

simMC

simHC

Sequencing the World of Possibilities for Energy & Environment

Binning results

• Strain sequences are typically assigned to multiple bins (unavoidable?).

• Individual bins typically contain more than one isolate (even different phyla).

• PhyloPythia is more consistent.

• Improvement of the protein databases increases the binning accuracy significantly.accuracy significantly.

• Methods still need improvement.

Page 20: Processing of metagenomic data. Data reduction ...– Arachne (Broad Institute, MIT). 9/15/2008 16 Sequencing the World of Possibilities for Energy & Environment Assembly fidelity

9/15/2008

20

Sequencing the World of Possibilities for Energy & Environment

http://fames.jgi-psf.org

•New methods are tested

•Users download the data.

•Apply their method.

•Download the results for comparison with the existing methods.

K. Mavrommatis, A. Greiner

Sequencing the World of Possibilities for Energy & Environment

Conclusions

• There is no “gold standard” method for metagenomic analysis.

E h t l/ bi ti f t l h diff t d t d• Each tool/ combination of tools has different advantages and disadvantages.

• Each dataset poses different challenges and requires different handling.

• Iterative process or use of multiple methods could be the key for better processing.

• Best combination so far: Arachne / fgenesB-genemark / PhyloPythiaBest combination … so far: Arachne / fgenesB genemark / PhyloPythia.