Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
9/15/2008
1
Sequencing the World of Possibilities for Energy & Environment
Processing of metagenomicdata. Data reduction.
Konstantinos MavrommatisGenome Biology Program
KMa rommatis@lbl [email protected]
Sequencing the World of Possibilities for Energy & Environment
Metagenomic analysis
S l E t t DNA Sh t l Hi h th h t
pyrosequencing
Sample Extract DNA Shotgun clone High throughput sequence
3, 8, 40 kb
Assemble reads Call genes Bin fragmentsNO FINISHING!
9/15/2008
2
Sequencing the World of Possibilities for Energy & Environment
• Large datasetsLarge number of genes– Large number of genes difficult to analyze manually.
– Redundant sequences (used for copy estimates).
– Loss of gene context information.
• Improvement of gene calls.• Data reduction (database
storage comp tationstorage, computation efficiency).
• Preservation of gene context information.
Sequencing the World of Possibilities for Energy & Environment
Overview
• Processing of data produced using Sanger sequencing technology.
• Handling of 454 data.
9/15/2008
3
Sequencing the World of Possibilities for Energy & Environment
Abundance comparisons
Sequencing the World of Possibilities for Energy & Environment
Gene prediction
• Quality of assembly affects gene prediction.– “Greedy” Phrap allows larger number of
predicted genes in contigs… but high error rate.
• Effect of gene prediction algorithm.– Fgenes/Genemark predict genes more
accurately.– Fgenes/Genemark better represent the real
content (function comparisons).– Metagene has higher rate of wrong
predictions.
isolateisolate
metagenomemetagenome
Mavromatis K. etal.Nat Method Jun;4(6)2007.
9/15/2008
4
Sequencing the World of Possibilities for Energy & Environment
Improvement of gene calls
• Compare genes against the nucleotide sequence of contigs/reads.
• Identify – correct gene calling errors.
• Identify – cluster similar sequences (>85% id.).
• Protein based assembly.Protein based assembly.
Sequencing the World of Possibilities for Energy & Environment
Gene correction
Select the gene with hit to a COG/ pfam/TIGRfam or the longest one.
Merge two genes that have hits to the same COG.
9/15/2008
5
Sequencing the World of Possibilities for Energy & Environment
Gene clustering
• Find highly similar g ysequences and create gene clusters.
85% identity
Sequencing the World of Possibilities for Energy & Environment
Scaffold protein assembly
Use the order of genes to merge scaffolds.•Reduced number of genes/scaffolds.•Preservation of Gene context
9/15/2008
6
Sequencing the World of Possibilities for Energy & Environment
Results
• Reduced size of metagenome.• Increased average scaffold size.
– 1094 1210• Increased average gene size.
– 627 692
Sequencing the World of Possibilities for Energy & Environment
Functional content
• No loss of functional content– 78 genes have different or no Pfam hits.– Annotation is more precise. Pfams with not
significant hits were removed.
Pfam COGsTermite Gut #3 2138 2262Termite Gut #3Processed T.gut 3c 2079 2233
9/15/2008
7
Sequencing the World of Possibilities for Energy & Environment
Phylogenetic content
Sequencing the World of Possibilities for Energy & Environment
Super scaffolds and scaffold groups
• 1784 can be connected linearly.• 846 super scaffolds• 846 super scaffolds.
• 5933 contigs can be connected.• 1076 groups of scaffolds.
9/15/2008
8
Sequencing the World of Possibilities for Energy & Environment
Population variation.
Sequencing the World of Possibilities for Energy & Environment
Overview
• Processing of data produced using Sanger sequencing technology.
• Handling of 454 data.
9/15/2008
9
Sequencing the World of Possibilities for Energy & Environment
454
Facts• Large size of sequenced data.• Short reads.
– minimal assembly.– no ab initio gene prediction.
QuestionsG di ti ?• Gene prediction?
• Metabolic reconstruction?• Binning?• Storage in databases?• Do we get a good representation of the community?
Sequencing the World of Possibilities for Energy & Environment
454
COGsPfam
.
.
rpsblastHMMer
.tblastx rpsblast
HMMer
Sensitivity?Accuracy?
Proxygenes
9/15/2008
10
Sequencing the World of Possibilities for Energy & Environment
Proxy gene clusters
function A
function A
f ti Afunction A
function A
Sequencing the World of Possibilities for Energy & Environment
454 simulated datasets
• 3 datasetsDataset Organism
Clostridium phytofermentans ISDg
P hl i NATL2A• 3 datasets.• 4 different
sequencing coverages (0.1X, 1X, 2X, 4X)
M1
Prochlorococcus marinus NATL2A
Lactobacillus reuteri 100-23
Caldicellulosiruptor saccharolyticus DSM 8903
Clostridium sp. OhILAs
Herpetosiphon aurantiacus ATCC 23779
Bacillus weihenstephanensis KBAB4
Halothermothrix orenii H168
Clostridium cellulolyticum H10
M2
Geobacter sp. FRC-32
Burkholderia multivorans ATCC 17616
Delftia acidovorans SPH-14X). M2 Delftia acidovorans SPH 1
Comamonas testosteroni KF-1
Geobacter lovleyi SZ
M3
Shewanella putrefaciens CN-32
Shewanella loihica PV-4
Halorhodospira halophila SL1
Pseudomonas putida F1
Shewanella baltica OS195
Bifidobacterium longum bv. Infantis ATCC 15697
Stenotrophomonas maltophilia R551-3
Parvibaculum lavamentivorans DS-1
9/15/2008
11
Sequencing the World of Possibilities for Energy & Environment
Accuracy
• Accuracy of assignment is notAccuracy of assignment is not affected by e-value threshold.
• The number of assigned reads increases with loser thresholds.
Sequencing the World of Possibilities for Energy & Environment
• Sampling does not mixSampling does not mix the datasets.
• Coverage effect. Higher coverage = more accurate representation.
• Proxy-cluster effect Proxy• Proxy-cluster effect. Proxy clusters are very similar to the proxy genes.
9/15/2008
12
Sequencing the World of Possibilities for Energy & Environment
Phylogenetic assignment
Sequencing the World of Possibilities for Energy & Environment
Data reduction
101000
151000
201000
251000
301000
351000
M1 - BBHM1 - ClusteringM2 - BBHM2 - ClusteringM3 - BBHM3 - Clustering
1000
51000
0 1 2 3 4
Coverage
9/15/2008
13
Sequencing the World of Possibilities for Energy & Environment
Thanks to…
G Bi l GGenome Biology Group• Nikos Kyrpides• Natalia Ivanova• Thanos Lykidis• Iain Anderson• Sean Hooper• Natalia Mikhailova
BTMTC group•Victor Markowitz•Ernest Szeto•Krishna Palaniappan•Yuri Grechkin• Galina Ovchinnikova •Yuri Grechkin•Amy Chen•Ken Chu•Daniel Dalevi
Sequencing the World of Possibilities for Energy & Environment
Overview
• Methods to evaluate the quality of algorithms used for metagenome analyses.
• Handling of 454 data.
• Reduction of redundancy in metagenomic dataset.
9/15/2008
14
Sequencing the World of Possibilities for Energy & Environment
Metagenomic datasets
High complexityHigh complexityLow complexityLow complexity Medium complexityMedium complexity High complexityHigh complexity(Farm soil)(Farm soil)
Tringe et al.Tringe et al.
Low complexityLow complexity(EBPR sludge)(EBPR sludge)
Garcia Martin et al.Garcia Martin et al. Tyson et al.Tyson et al.
Medium complexityMedium complexity(Acid mine drainage)(Acid mine drainage)
Science 2005Science 2005Nature Biotech 2006Nature Biotech 2006 Nature 2004Nature 2004
Sequencing the World of Possibilities for Energy & Environment
Metagenomic analysis
• Methods for assembly and gene prediction were developed for i l l iisolate genomes analysis.
• How accurate are they in metagenomes?• Binning methods have not been compared to each other.
• Create simulated metagenomes from isolate genomes by combining random paired reads.
• Assemble, predict genes, bin the “metagenomes”.• Track reads of individual organisms in the assembled contigs and bins.• Evaluate each assembly method.• Compare gene predictions against isolate genomes.• Evaluate binning methods.• Provide datasets for benchmarking of new methods.
9/15/2008
15
Sequencing the World of Possibilities for Energy & Environment
4
5
6
SimLC(e.g. sludge)
Simulated Datasets composition
0
1
2
3
2
3
4
5
6
SimMC(e.g. AMD)
vera
ge (
x)ve
rage
(x)
All 3 simsets comprise ~100,000 readsAll 3 simsets comprise ~100,000 reads
0
1
2
3
4
5
60
1
SimHC(e.g. soil)
Cov
Cov
isolate genomesisolate genomes
Sequencing the World of Possibilities for Energy & Environment
Assembly overview
• Assembly:– Phrap (Phil Green, Univ. Washington).– JAZZ (JGI).– Arachne (Broad Institute, MIT).
9/15/2008
16
Sequencing the World of Possibilities for Energy & Environment
Assembly fidelity
Homogeneous - one isolate genome only: fidelity = 100%
Chimera of two strains of same species: fidelity = 70% (but same species)
Chimera of multiple species: fidelity = 90%
Sequencing the World of Possibilities for Energy & Environment
Assembly comparison
eric
seq
uenc
e
of c
him
eris
m
% c
him
e
Deg
ree
o
9/15/2008
17
Sequencing the World of Possibilities for Energy & Environment
Assembly summary
• Approximately 80 -90 % of the contigs are homogeneous.• Small contigs (<8 Kb) are more likely to be chimeric (simLC).• The presence of closely related species creates chimeric contigs by mixing reads from
different strains in simMC.• SimHC is practically unassembled.• Phrap is the “greediest” assembler. Produces the highest number of chimeric contigs.• Arachne (and JAZZ) is more conservative and produces better quality contigs.
Common sequences between reads of different organisms result in “good” alignments…and chimeric contigs… and chimeric contigs.
• Bad trimming. – Vector sequences at the ends. – Poor quality sequence.
• Common sequences (phages, transposases, ABC transporters).• Low thresholds for similarity/length of alignment.
Sequencing the World of Possibilities for Energy & Environment
Gene prediction overview
• Gene prediction:Ge e p ed ct o– Evidence based/ab initio prediction:
• ORNL pipeline (Critica, GLIMMER).
– ab initio prediction:fgenesb (SoftBerry).
metagene (Nog chi et al NAR 2006)metagene (Noguchi et al., NAR 2006).
9/15/2008
18
Sequencing the World of Possibilities for Energy & Environment
Errors in gene prediction
• Errors usually at the end of reads/contigs.• Bad trimming. Vector sequences at the ends are “interpreted” as g q p
fragments of genes.• Poor quality sequence (frameshifts, N). Produces fragments on the
wrong strand.• Regions of chimeric assembly.
Sequencing the World of Possibilities for Energy & Environment
Binning overview
• Pattern discovery [domain - family] (PhyloPythia). • Sequence composition [family](kmer).
7 NNNNNNN 8 NN NN NN– 7mer NNNNNNN, 8mer NNxNNxNN.• Assignment based on blast hits [class] (BLAST distr).
– Databases with 130 and ~600 genomes.
Ideal scenario
stra
ins)
In reality
h as
m
ains
,es
)
isol
ates
(p
refe
rabl
y s
Bins
Bins
isol
ates
(g
roup
s su
cfa
mili
es, d
o mra
rely
spe
cie
9/15/2008
19
Sequencing the World of Possibilities for Energy & Environment
Binning comparison (8 Kb contigs)
Specificity (8 Kb)
0 9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
BLAS
T di
str (
v2) 1
- Ar
achn
e
BLAS
T di
str (
v2) 1
- JA
ZZ
BLAS
T di
str (
v2) 1
- Ph
rap
BLAS
T di
str (
v2) 2
- Ar
achn
e
BLAS
T di
str (
v2) 2
- JA
ZZ
BLAS
T di
str (
v2) 2
- Ph
rap
BLAS
T di
str 1
- Ar
achn
e
BLAS
T di
str 1
- JA
ZZ
BLAS
T di
str 1
- Ph
rap
BLAS
T di
str 2
- Ar
achn
e
BLAS
T di
str 2
- JA
ZZ
BLAS
T di
str 2
- Ph
rap
hylo
pyth
ia (p
:0.8
5) -
Arac
hne
n Ph
ylop
ythi
a (p
:0.8
5) -
JAZZ
n Ph
ylop
ythi
a (p
:0.8
5) -
Phra
p
hylo
pyth
ia (p
:0.8
5) -
Arac
hne
p Ph
ylop
ythi
a (p
:0.8
5) -
JAZZ
p Ph
ylop
ythi
a (p
:0.8
5) -
Phra
p
kmer
(8m
er) -
Ara
chne
kmer
(8m
er) -
JAZ
Z
kmer
(8m
er) -
Phr
ap
spec
ifici
tysimLCsimMC
simHC
Sensitivity (8 Kb)
0.8
0.9
1
B B
gen
Ph gen
gen
ssp
Ph ssp
ssp
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
BLAS
T di
str (
v2) 1
- Ar
achn
e
BLAS
T di
str (
v2) 1
- JA
ZZ
BLAS
T di
str (
v2) 1
- Ph
rap
BLAS
T di
str (
v2) 2
- Ar
achn
e
BLAS
T di
str (
v2) 2
- JA
ZZ
BLAS
T di
str (
v2) 2
- Ph
rap
BLAS
T di
str 1
- Ar
achn
e
BLAS
T di
str 1
- JA
ZZ
BLAS
T di
str 1
- Ph
rap
BLAS
T di
str 2
- Ar
achn
e
BLAS
T di
str 2
- JA
ZZ
BLAS
T di
str 2
- Ph
rap
gen
Phyl
opyt
hia
(p:0
.85)
- Ar
achn
e
gen
Phyl
opyt
hia
(p:0
.85)
- JA
ZZ
gen
Phyl
opyt
hia
(p:0
.85)
- Ph
rap
ssp
Phyl
opyt
hia
(p:0
.85)
- Ar
achn
e
ssp
Phyl
opyt
hia
(p:0
.85)
- JA
ZZ
ssp
Phyl
opyt
hia
(p:0
.85)
- Ph
rap
kmer
(8m
er) -
Ara
chne
kmer
(8m
er) -
JAZ
Z
kmer
(8m
er) -
Phr
ap
sens
itivi
ty
simLC
simMC
simHC
Sequencing the World of Possibilities for Energy & Environment
Binning results
• Strain sequences are typically assigned to multiple bins (unavoidable?).
• Individual bins typically contain more than one isolate (even different phyla).
• PhyloPythia is more consistent.
• Improvement of the protein databases increases the binning accuracy significantly.accuracy significantly.
• Methods still need improvement.
9/15/2008
20
Sequencing the World of Possibilities for Energy & Environment
http://fames.jgi-psf.org
•New methods are tested
•Users download the data.
•Apply their method.
•Download the results for comparison with the existing methods.
K. Mavrommatis, A. Greiner
Sequencing the World of Possibilities for Energy & Environment
Conclusions
• There is no “gold standard” method for metagenomic analysis.
E h t l/ bi ti f t l h diff t d t d• Each tool/ combination of tools has different advantages and disadvantages.
• Each dataset poses different challenges and requires different handling.
• Iterative process or use of multiple methods could be the key for better processing.
• Best combination so far: Arachne / fgenesB-genemark / PhyloPythiaBest combination … so far: Arachne / fgenesB genemark / PhyloPythia.