Center for Biological Sequence Analysis Josef K …€¦ · Center for Biological Sequence Analysis Functional Human Variation Group Metagenomic assembly Josef K Vogt ... 31.3 62.7

27626 - Next Generation Sequencing Analysis

Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group

Metagenomic assemblyJosef K Vogt

Slides by: Simon Rasmussen

2017


Metagenomics

• Investigate all organisms (and genomic elements)

• Known/unknown

• Culture / Non-culturable (large fraction non-cult.)

Bacteria

plasmids

phages

genomicislands


Metagenomic assembly ...is even harder than single genome assembly


Why?

Can you think of why?

Groups of 2-3 3 mins

• Shared/similar regions between genomes works as repeats

• Different abundances

• High diversity, very large meta-genomes

• Which contigs belongs to which organism?

Bacteria

Fragmented DNA

Digital DNA fragments

GEN

OM

ICDNA

Assembly

Env. sample

DNA Fragmented DNA

Digital DNA fragments Digital DNA fragments

MET

AG

ENO

MIC

Assembly

Env. sample

Fragmented DNA

OU

R V

IEW

Digital DNA fragments Digital DNA fragments

DNA

Assembly


Example with k-mers

1 9 19 30 41 52 63 74 85 96 109 123 137 151 165 179 193 207 221

Counts of a k−mer

No

of k

mer

s

0e+0

01e

+06

2e+0

63e

+06

4e+0

65e

+06

1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 104 113 122 131 140

Counts of a k−mer

No

of k

mer

s

0e+0

01e

+05

2e+0

53e

+05

4e+0

55e

+05

Single genome Meta genome


Which assembler then?

• Assemblers: most people use standard assemblers (eg. SOAP denovo / Newbler)

• A couple of metagenomic assemblers:

• Megahit, IDBA-ud (Illumina)

• MetaVelvet (Illumina, Solid, 454)


Meta-approach

• Differences in coverage between genomes

• Use of mate-pair / paired end reads to dis-entagle the graph

• After assembly: Sequence composition binning

• Re-assembly: Decompose the data into smaller genome bins


MetaVelvet/IDBA

[2] Detection of multiple peaks on k-mer frequencies:2. Calculate the empirical distribution of

‘length-weighted frequencies’ of node coverages,where a node coverage is assigned to each nodeby Velvet on the construction of the de Bruijngraph (Figure 5).

3. Approximate the empirical distribution by amixture of Poisson distributions and detectmultiple peaks in the Poisson mixture. Then, thehighest peak of expected coverage is chosen asthe ‘primary expected coverage’, and the nexthighest is chosen as the ‘secondary expectedcoverage’.

4. Classify every node into one distribution of thePoisson mixture by calculating its posterior prob-ability for the node coverage value.

[3] Decomposition of the de Bruijn graph:5. (Decomposition by connectivity) Decompose the

initial de Bruijn graph into connected subgraphs.6. (Decomposition by coverage value) If the coverage

of a node belongs to the primary expectedcoverage, the node is classified as a ‘primarynode’. Subsequently, the primary nodes arelabeled as ‘ON’ and the other nodes are labeledas ‘OFF’. Then, a chimeric node is detected as

having two incoming edges whose origin nodesare labeled ON and OFF, and two outgoingedges whose destination nodes are labeled ONand OFF, and having a coverage value mostlyequal (within 5% difference by default) to theaverage between the sum of the coverage valuesof the two origin nodes and the sum of the twodestination nodes. Second, check the consistencyof the ON and OFF labeling for the two originnodes and two destination nodes using paired-endinformation. If the consistency is satisfied, resolveevery chimeric node by separating the nodeinto two nodes with only one incoming edgeand one outgoing edge, whose origin and destin-ation nodes have the same label, as shown inFigure 4. After separating the chimeric nodes,further decompose the resulting graph into con-nected subgraphs.

7. If a connected subgraph consists of more thanx% (a predefined parameter, the default is set to100%) of nodes labeled ‘ON’, the subgraph isunmasked. All other subgraphs are masked.

[4] Assembly of contigs and scaffolds:8. Apply the Velvet functions to the unmasked

subgraphs to build contigs and then applyPebble and Rock Band functions to buildscaffolds.

9. Remove the unmasked subgraphs and recursivelyapply Step 2–8 to the remaining de Bruijn graphuntil no node remains.

It might be thought that in Substep 3 above, a chimericnode could have the highest expected coverage. However,the contigs of chimeric nodes are very short comparedwith the unique nodes; therefore, the length-weightedfrequencies of coverage values for the chimeric nodes donot form any significant peaks.

EXPERIMENTAL RESULTS

The performance of the MetaVelvet assembler was testedon simulated datasets and on real metagenome datasetsobtained from human gut microbiome. The method wascompared with the naive use of two single-genome as-semblers, Velvet (15) and SOAPdenovo (22), and therecently proposed metagenome assembler Meta-IDBA(6). Furthermore, for the simulated datasets, wecompared our results with those of a single-genomeassembly from pure sequence reads of each single-isolategenome. We compared the following standard statisticalmeasures to evaluate the performance of the assemblersfor short read assembly and metagenome assembly: thenumber of scaffolds, the total length of scaffolds andN50, where N50 indicates the scaffold length such that50% of the de novo assembled sequences lie in scaffoldsof this size or larger. The precise definition of N50 is asfollows. Let jAj denote the length of a sequence (contig,scaffold or genome) A. Let S1, S2, . . . , Sn denote the list ofscaffolds in descending order of length as output by anassembler. Let L denote the total length of all scaffolds,

90.8

28.5

59.6

31.3

62.7

expected coverageprimary : 60

secondary : 30

OFF

ON

OFF

ON

28.5

59.6

31.3

62.7

OFF

ON

OFF

ON

separating

Chimericnode

Figure 4. An example of a chimeric node and its resolution byseparating the node. The node is chimeric because [(28.5+59.6)+(31.3+62.7)]/2. 90.8.

Node coverages

Leng

th-w

eigh

ted

Freq

uenc

y

10 30 60

PrimarySecondary

Figure 5. Detection of multiple peaks in the histogram of coveragevalues of the nodes.

PAGE 5 OF 12 Nucleic Acids Research, 2012, Vol. 40, No. 20 e155

by guest on May 31, 2013

http://nar.oxfordjournals.org/D

ownloaded from

2. Identify coverage peaks

1. Perform standard assembly (Velvet)

[2] Detection of multiple peaks on k-mer frequencies:2. Calculate the empirical distribution of

‘length-weighted frequencies’ of node coverages,where a node coverage is assigned to each nodeby Velvet on the construction of the de Bruijngraph (Figure 5).

3. Approximate the empirical distribution by amixture of Poisson distributions and detectmultiple peaks in the Poisson mixture. Then, thehighest peak of expected coverage is chosen asthe ‘primary expected coverage’, and the nexthighest is chosen as the ‘secondary expectedcoverage’.

4. Classify every node into one distribution of thePoisson mixture by calculating its posterior prob-ability for the node coverage value.

[3] Decomposition of the de Bruijn graph:5. (Decomposition by connectivity) Decompose the

initial de Bruijn graph into connected subgraphs.6. (Decomposition by coverage value) If the coverage

of a node belongs to the primary expectedcoverage, the node is classified as a ‘primarynode’. Subsequently, the primary nodes arelabeled as ‘ON’ and the other nodes are labeledas ‘OFF’. Then, a chimeric node is detected as

having two incoming edges whose origin nodesare labeled ON and OFF, and two outgoingedges whose destination nodes are labeled ONand OFF, and having a coverage value mostlyequal (within 5% difference by default) to theaverage between the sum of the coverage valuesof the two origin nodes and the sum of the twodestination nodes. Second, check the consistencyof the ON and OFF labeling for the two originnodes and two destination nodes using paired-endinformation. If the consistency is satisfied, resolveevery chimeric node by separating the nodeinto two nodes with only one incoming edgeand one outgoing edge, whose origin and destin-ation nodes have the same label, as shown inFigure 4. After separating the chimeric nodes,further decompose the resulting graph into con-nected subgraphs.

7. If a connected subgraph consists of more thanx% (a predefined parameter, the default is set to100%) of nodes labeled ‘ON’, the subgraph isunmasked. All other subgraphs are masked.

[4] Assembly of contigs and scaffolds:8. Apply the Velvet functions to the unmasked

subgraphs to build contigs and then applyPebble and Rock Band functions to buildscaffolds.

9. Remove the unmasked subgraphs and recursivelyapply Step 2–8 to the remaining de Bruijn graphuntil no node remains.

It might be thought that in Substep 3 above, a chimericnode could have the highest expected coverage. However,the contigs of chimeric nodes are very short comparedwith the unique nodes; therefore, the length-weightedfrequencies of coverage values for the chimeric nodes donot form any significant peaks.

EXPERIMENTAL RESULTS

The performance of the MetaVelvet assembler was testedon simulated datasets and on real metagenome datasetsobtained from human gut microbiome. The method wascompared with the naive use of two single-genome as-semblers, Velvet (15) and SOAPdenovo (22), and therecently proposed metagenome assembler Meta-IDBA(6). Furthermore, for the simulated datasets, wecompared our results with those of a single-genomeassembly from pure sequence reads of each single-isolategenome. We compared the following standard statisticalmeasures to evaluate the performance of the assemblersfor short read assembly and metagenome assembly: thenumber of scaffolds, the total length of scaffolds andN50, where N50 indicates the scaffold length such that50% of the de novo assembled sequences lie in scaffoldsof this size or larger. The precise definition of N50 is asfollows. Let jAj denote the length of a sequence (contig,scaffold or genome) A. Let S1, S2, . . . , Sn denote the list ofscaffolds in descending order of length as output by anassembler. Let L denote the total length of all scaffolds,

90.8

28.5

59.6

31.3

62.7

expected coverageprimary : 60

secondary : 30

OFF

ON

OFF

ON

28.5

59.6

31.3

62.7

OFF

ON

OFF

ON

separating

Chimericnode

Figure 4. An example of a chimeric node and its resolution byseparating the node. The node is chimeric because [(28.5+59.6)+(31.3+62.7)]/2. 90.8.

Node coverages

Leng

th-w

eigh

ted

Freq

uenc

y

10 30 60

PrimarySecondary

Figure 5. Detection of multiple peaks in the histogram of coveragevalues of the nodes.

PAGE 5 OF 12 Nucleic Acids Research, 2012, Vol. 40, No. 20 e155

by guest on May 31, 2013

http://nar.oxfordjournals.org/D

ownloaded from

3. Decompose into sub-graphs

4. Assemble the subgraphs into contigs

and scaffolds


Count matrix• Even with metagenome

assemblers it is very hard!

• If we cant assemble all the genomes, we can get genes

• Predict genes -> map reads back and create a “count matrix”

• These genes can be annotated and used for analysis

Sample A Sample B

Gene 1 10 9

Gene 2 20 20

Gene 3 10 20

Gene 4 5 0

Species abundanceFunctional abundance

Re-assemblyAnalogous to RNAseq

Gene co-abundance

geneA geneB

The genes travel together - similar abundance!

Genome

The setup

Gene!abundance

matrix!(Depth)

396 Samples

3.9M

gen

es

• 396 human stool samples (MetaHit)

• Very complex community

• Initial assembly, gene calling, abundance

Abundance for one gene

1 396

Abu

ndan

ce

Samples

Abundance profile for 4,358 genes

1 396

Abu

ndan

ce

Samples

MetaGenomic Species (MGS)

MetaGenomic Species

Bacteriophages, plasmids, CRIPS

~bacterial genome size

741

> 7000

Documents

Center for Biological Sequence Analysis Josef K …€¦ · Center for Biological Sequence Analysis Functional Human Variation Group Metagenomic assembly Josef K Vogt ... 31.3 62.7