38
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25

Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

Towards More Effective Formulations of the GenomeAssembly Problem

Alexandru TomescuDepartment of Computer ScienceUniversity of Helsinki, Finland

DACSJune 26, 2015

1 / 25

Page 2: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

2 / 25

Page 3: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CENTRAL DOGMA OF BIOLOGY

DNA

transcription

binding

promoterenhancer

silencer

protein-protein

interaction

gene 1

pre-mRNA

mature mRNA transcripts

intron exon

alternative splicing

translation

proteins

TFBS

gene 2

...

Image taken fromGenome-Scale Algorithm Design, Cambridge University Press, 2015

3 / 25

Page 4: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

SEQUENCING ATLAS

DNA

gene

mature mRNA transcripts

proteins

RNA sequencing

ChIP sequencing

methylation

Bisulfite sequencing

binding

primer primer

Targeted resequencing

De novo sequencing / Whole genome resequencing

TFBS

Image taken fromGenome-Scale Algorithm Design, Cambridge University Press, 2015

4 / 25

Page 5: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

HIGH-THROUGHPUT SEQUENCING

)

amplification

)

break apart

)

size selection

)sequencing

length ~450 100 100~250

paired-end reads

DNA

We assume here that DNA is a single stranded, single chromosome

5 / 25

Page 6: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

HIGH-THROUGHPUT SEQUENCING

)

amplification

)

break apart

)

size selection

)sequencing

length ~450 100 100~250

paired-end reads

DNA

We assume here that DNA is a single stranded, single chromosome

5 / 25

Page 7: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

HIGH-THROUGHPUT SEQUENCING

)

amplification

)

break apart

)

size selection

)sequencing

length ~450 100 100~250

paired-end reads

DNA

We assume here that DNA is a single stranded, single chromosome

5 / 25

Page 8: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

ILLUMINA HISEQX

6 / 25

Page 9: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GENOME ASSEMBLY PROBLEME.coli 4.6 · 106

Human 3.2 · 109

Spurce 25 · 109

INPUT: A collection of paired-end readsOUTPUT: The genome

Initial formulations:I Shortest superstring problem (NP-hard)I Build a graph with reads as nodes, and significant overlaps between

reads as directed edges:I Find a walk that passes through every node exactly once (NP-complete)I Find a walk that passes through every node at least once

Unrealistic:I Longer repeated regions are collapsedI Genome coverage is not uniformI We cannot choose between multiple solutions

7 / 25

Page 10: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GENOME ASSEMBLY PROBLEME.coli 4.6 · 106

Human 3.2 · 109

Spurce 25 · 109

INPUT: A collection of paired-end readsOUTPUT: The genome

Initial formulations:I Shortest superstring problem (NP-hard)I Build a graph with reads as nodes, and significant overlaps between

reads as directed edges:I Find a walk that passes through every node exactly once (NP-complete)I Find a walk that passes through every node at least once

Unrealistic:I Longer repeated regions are collapsedI Genome coverage is not uniformI We cannot choose between multiple solutions

7 / 25

Page 11: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GENOME ASSEMBLY PROBLEME.coli 4.6 · 106

Human 3.2 · 109

Spurce 25 · 109

INPUT: A collection of paired-end readsOUTPUT: The genome

Initial formulations:I Shortest superstring problem (NP-hard)I Build a graph with reads as nodes, and significant overlaps between

reads as directed edges:I Find a walk that passes through every node exactly once (NP-complete)I Find a walk that passes through every node at least once

Unrealistic:I Longer repeated regions are collapsedI Genome coverage is not uniformI We cannot choose between multiple solutions

7 / 25

Page 12: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PRACTICAL FORMULATIONS / PIPELINE

1. Contig assembly: assemble the reads into strings (contigs) that areguaranteed to occur in the genome

GTACGATAACGTACG

GATATCTACTAGTACCC

CTAATTCGA

ACGTACGATATCTAcontig:

2. Scaffolding: using paired-end reads, chain the contigs into scaffolds thatare guaranteed to occur in the genome

8 / 25

Page 13: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PRACTICAL FORMULATIONS / PIPELINE

1. Contig assembly: assemble the reads into strings (contigs) that areguaranteed to occur in the genome

GTACGATAACGTACG

GATATCTACTAGTACCC

CTAATTCGA

ACGTACGATATCTAcontig:

2. Scaffolding: using paired-end reads, chain the contigs into scaffolds thatare guaranteed to occur in the genome

8 / 25

Page 14: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PRACTICAL FORMULATIONS / PIPELINE (2)

3. Gap filling: fill the gaps in the scaffolds

Tens of genome assembly programs available: ABySS, Velvet, Allpaths-LG,Bambus2, MSR-CA, SGA, Cortex, SOAPdenovo, Opera-LG, SPADES, ...

9 / 25

Page 15: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

DE BRUIJN GRAPHS

DEFINITION

Given a set R of strings, the de Bruijn graph of order k of R is the directedgraph DBk(R) with

I node set: the set of k-mers of RI edge set: the set of k + 1-mers of the strings of R

Also edges occur in the strings of R!

ATGCGTGGCAATGCG

TGGCACGTG AT TG GC

CGGT

GG

CA

k = 2

10 / 25

Page 16: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

DE BRUIJN GRAPHS

DEFINITION

Given a set R of strings, the de Bruijn graph of order k of R is the directedgraph DBk(R) with

I node set: the set of k-mers of RI edge set: the set of k + 1-mers of the strings of R

Also edges occur in the strings of R!

ATGCGTGGCAATGCG

TGGCACGTG AT TG GC

CGGT

GG

CA

k = 2

10 / 25

Page 17: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

joint work with Paul Medvedev

11 / 25

Page 18: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

I No previous formal definition of contigI Usually, contigs are maximal, unary paths (i.e., whose internal nodes

have in-degree and out-degree 1, aka unitigs)

v0 v1 v2 vk vk+1 vt vt+1

Given a dBG G:I a genomic walk of G is a circular edge-covering walk of GI a walk is safe if it is a sub-walk of all genomic walks of G

We now assume that the dBG admits a genomic walk (i.e., is stronglyconnected) and is not a single cycle.

12 / 25

Page 19: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

I No previous formal definition of contigI Usually, contigs are maximal, unary paths (i.e., whose internal nodes

have in-degree and out-degree 1, aka unitigs)

v0 v1 v2 vk vk+1 vt vt+1

Given a dBG G:I a genomic walk of G is a circular edge-covering walk of GI a walk is safe if it is a sub-walk of all genomic walks of G

We now assume that the dBG admits a genomic walk (i.e., is stronglyconnected) and is not a single cycle.

12 / 25

Page 20: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

I No previous formal definition of contigI Usually, contigs are maximal, unary paths (i.e., whose internal nodes

have in-degree and out-degree 1, aka unitigs)

v0 v1 v2 vk vk+1 vt vt+1

Given a dBG G:I a genomic walk of G is a circular edge-covering walk of GI a walk is safe if it is a sub-walk of all genomic walks of G

We now assume that the dBG admits a genomic walk (i.e., is stronglyconnected) and is not a single cycle.

12 / 25

Page 21: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY (2)

We say that a contig assembly algorithm isI sound: if every output walk is safeI complete: if every safe walk is in the output

The unitig algorithm is:I outputting all maximal unitigsI soundI not complete

Is there a sound and completecontig assembly algorithm?

13 / 25

Page 22: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY (2)

We say that a contig assembly algorithm isI sound: if every output walk is safeI complete: if every safe walk is in the output

The unitig algorithm is:I outputting all maximal unitigsI soundI not complete

Is there a sound and completecontig assembly algorithm?

13 / 25

Page 23: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY (2)

We say that a contig assembly algorithm isI sound: if every output walk is safeI complete: if every safe walk is in the output

The unitig algorithm is:I outputting all maximal unitigsI soundI not complete

Is there a sound and completecontig assembly algorithm?

13 / 25

Page 24: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

NON-SWITCHING CONTIGS

v0 v1 v2 vk vk+1 vt vt+1

I A path with all out-branching nodes before all in-branching nodes

I Related to transformation-based algorithms ofI Kingsford, Schatz, Pop 2010I Jackson 2009I Medvedev, Georgiou, Myers, Brudno 2007

THEOREM

There is an O(|G|+ |output|)-time algorithm to output all maximal non-switchingcontigs of G.

THEOREM

The non-switching contig assembly algorithm is sound but not complete.

14 / 25

Page 25: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

NON-SWITCHING CONTIGS

v0 v1 v2 vk vk+1 vt vt+1

I A path with all out-branching nodes before all in-branching nodesI Related to transformation-based algorithms of

I Kingsford, Schatz, Pop 2010I Jackson 2009I Medvedev, Georgiou, Myers, Brudno 2007

THEOREM

There is an O(|G|+ |output|)-time algorithm to output all maximal non-switchingcontigs of G.

THEOREM

The non-switching contig assembly algorithm is sound but not complete.

14 / 25

Page 26: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

NON-SWITCHING CONTIGS

v0 v1 v2 vk vk+1 vt vt+1

I A path with all out-branching nodes before all in-branching nodesI Related to transformation-based algorithms of

I Kingsford, Schatz, Pop 2010I Jackson 2009I Medvedev, Georgiou, Myers, Brudno 2007

THEOREM

There is an O(|G|+ |output|)-time algorithm to output all maximal non-switchingcontigs of G.

THEOREM

The non-switching contig assembly algorithm is sound but not complete.

14 / 25

Page 27: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

OmniTIGS

v0 vi vj vt+1e0 ei−1 ei ej−1 ej et

We say that a walk w = (v0, e0, v1, e1, . . . , vt, et, vt+1) is an omnitig if for all1 ≤ i ≤ j ≤ t, there is no proper vj-vi path with first edge different from ej,and last edge different from ei−1.

THEOREM

A walk w is safe⇔ w is an omnitig.

THEOREM

There is a polynomial time algorithm for outputting all maximal omnitigs.

COROLLARY

The omnitig algorithm is sound and complete.

15 / 25

Page 28: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

OmniTIGS

v0 vi vj vt+1e0 ei−1 ei ej−1 ej et

We say that a walk w = (v0, e0, v1, e1, . . . , vt, et, vt+1) is an omnitig if for all1 ≤ i ≤ j ≤ t, there is no proper vj-vi path with first edge different from ej,and last edge different from ei−1.

THEOREM

A walk w is safe⇔ w is an omnitig.

THEOREM

There is a polynomial time algorithm for outputting all maximal omnitigs.

COROLLARY

The omnitig algorithm is sound and complete.

15 / 25

Page 29: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

EXPERIMENTAL RESULTS

Algorithm #strings E-size AVG lengthunitig 41,524 7,806 893non-switching contigs 32,589 7,822 1,136omnitigs 24,949 7,850 1,479

I genome: circularized human chr21 (length 48 · 106)I graph: dBGk(chr21) for k = 55I e-size: given a set of substrings of genome, their e-size is the average,

over all genomic positions i, of the mean length of the strings spanningposition i

16 / 25

Page 30: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GAP FILLING

joint work with Leena Salmela, Kristoffer Sahlin and Veli Makinen

17 / 25

Page 31: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PROBLEM FORMULATIONPrevious formulations (GapCloser 2012, GapFiller 2012):

s t

We formulate it as Exact Path Length problem. Given:I G = dBGk(R), for some kI s, t ∈ V(G), the two k-mers flanking the gapI [d′..d] an estimate on the gap length

For every x ∈ [d′..d] find an s-t path spelling a string of length x (i.e. a path oflength x− k).

s

ts

t

18 / 25

Page 32: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PROBLEM FORMULATIONPrevious formulations (GapCloser 2012, GapFiller 2012):

s t

We formulate it as Exact Path Length problem. Given:I G = dBGk(R), for some kI s, t ∈ V(G), the two k-mers flanking the gapI [d′..d] an estimate on the gap length

For every x ∈ [d′..d] find an s-t path spelling a string of length x (i.e. a path oflength x− k).

s

ts

t

18 / 25

Page 33: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

DYNAMIC PROGRAMMING (DP)

Usually, d < |V(G)|.

Can be solved by DP in time O(d · |E(G)|):I for every node v store:

a(v, i) :=

{1 if there exists an s-v path of length i,0 otherwise.

I initialize a(s, 0) = 1, and compute

a(v, i) :=∨

u∈N−(v)

a(u, i− 1).s

ts

t

vu a(v,i)a(u,i-1)

I back-tracking in the DP matrix from a(t, x− k) gives a path (if exists)

19 / 25

Page 34: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

ENGINEERED IMPLEMENTATION

I k-mers flanking the gaps can have errorsI allow paths to start/end at up to t flanking k-mers

I we should not explore the entire graph!I meet in the middle optimization

I DP matrix rows are sparseI store only the non-zero entries in each row

I parallelization on scaffold level

I limit the memory usage of the DP matrixI abandon the search on a gap if limit exceeded

20 / 25

Page 35: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

EXPERIMENTAL RESULTS (Gap2Seq)

RemainingGap Length

ErroneousLength

0

1 · 105

2 · 105

3 · 105

4 · 105OriginalGapCloserGapFiller-bowtieGapFiller-bwaGap2Seq

I genome: Staphylococcus aureus (length 2.8 · 106)I graph: dBG with k = 31, for R = a collection of real readsI results are totals over all scaffolds from a benchmark dataset (ABySS,

Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet)

21 / 25

Page 36: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

EXPERIMENTAL RESULTS (Gap2Seq)

Runtime (min)0

20

40

60

80

100

120

Peak Memory (GB)0

0.2

0.4

0.6

0.8GapCloserGapFiller-bowtieGapFiller-bwaGap2Seq

22 / 25

Page 37: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONCLUSIONS

Potential uses for omnitigs:I longer contigs = better starting point for scaffolding and gap fillingI more flanking information around loci of interest

Directions for omnitigs:I robustness to errors, coverage gaps, reverse complements?I faster omnitig algorithmI a sound and complete algorithm when the genomic walk is linear?

————————

Gap Filling:

I performance on human data is ‘so and so’ (but all tools have problems)I we need to improve the runtime and memory usageI we need a way to choose between multiple solutions (contig assembly?)

23 / 25

Page 38: Towards More Effective Formulations of the Genome Assembly ... · High-throughput sequencing Genome assembly problem Contig assembly Gap filling End GENOME ASSEMBLY PROBLEM E.coli

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

MULTUMESC

24 / 25