De-Novo Genome Assembly and its Current State

Anne-Katrin Emde April 17, 2013

Freie Universität Berlin, Algorithmische Bioinformatik Max Planck Institut für Molekulare Genetik, Computational Molecular Biology

What is genome assembly and why is it hard?

Original genome sequence (in multiple copies) Sequence short fragments Reconstruct original sequence from reads

Assembler

Difficulties: - Genomes are very long and repeat-rich - Reads are very short and may contain errors and biases

Assembly algorithms - introduction Assemblers use overlaps between reads, to first produce contigs, that are then used to build scaffolds.

Read pairs contribute long-range linking information, especially in the scaffolding phase.

contig NNNN...NNNN

scaffold

2 Paradigmen 1) Overlap – Layout – Consensus: Focus is on sequence reads

2) De Bruijn graph: Focus is on k-tuples occuring in sequence reads

NGS algorithms, March 20 2012

Algorithms – Overlap Graph

ACGTAATT GTAATTCA ATTCAGTC GTCCATGT CATGTTGA TGTTGACT

ACGTAATTCAGTCCATGTTGACT Kececioglu and Myers, 1995

1) Overlap phase: pairwise overlap alignments

2) Layout phase: overlap graph construction and finding relative placement of reads

3) Consensus phase: Produce multiple read alignment and compute contig consensus sequences

Algorithms – Overlap Graph 1) Overlap phase:

for all pairs of reads

Computationally expensive!

pairwise overlap alignments

2) Layout phase:

nodes = reads edges = overlaps

r5 r6 -6

2) Layout phase:

r5 r6 -6

path through the graph read layout

In theory Hamiltonian path (NP-complete), in practice heuristics

2) Layout phase:

r5 r6 -6

3) Consensus phase: multiple read alignment

contig

ACGTAATT GTAATTCA ATTCAGTC GTCCATGT CATGTTGA TGTTGACT

ACGTAATTCAGTCCATGTTGACT

r1 r2 r3 r4 r5 r6

2) Layout phase:

3) Consensus phase: multiple read alignment

contig

ACGTAATT GTAATTCA AT - CAGTC GTCCATGT CATATTGA TGTTGACT

ACGTAATTCAGTCCATGTTGACT

r1 r2 r3 r4 r5 r6

robust with alignment errors

r5 r6 -6

Overlap Graph There are not only simple paths…

R R A B C

- Coverage is high - Path branches is a repeat! R

Approximate ordering in the overlap graph:

Overlap Graph There are not only simple paths…

R R A B C

Overlap Graph Now: R1 and R2 are nearly identical

A A A A T

A A A A

T T T T

R1 R2 A B C

Overlap strictness: Tradeoff between error tolerance and “natural” repeat separation

Overlap Graph Now: R1 and R2 are nearly identical

Overlap strictness: Tradeoff between error tolerance and “natural” repeat separation

R1 R2 A B C

A T T T T

Algorithms - de Bruijn graph (Euler assembler) - No overlap phase, no consensus phase, basically just a layout phase

ACGT CGTA GTAA TAAT TACG TTCA ATTC AATT

contig: TACGTAATTCA

Nodes = k-mers Edges = (k+1)-mers

de Bruijn, 1946; Pevzner, 2001; Medvedev 2007

Given k = 4 and three read sequences: CGTAATTC GTAATTCA TACGTAAT

CGTA GTAA TAAT AATT ATTC GTAA

TACGTAAT CGTAATTC GTAATTCA

In theory „de Bruijn Superwalk Problem“ (NP-hard), in practice heuristics

de Bruijn graph Again, there are not only simple paths…

...G ACGT ACGTCA... GACGTACG

CGTACGTC GTACGTCA

GACGTCA is a read-incoherent path

Repetitive k-mers introduce cycles

de Bruijn graph What if we increase k to 5?

...GACGTACGTCA... GACGTACG

CGTACGTC GTACGTCA

Back to a linear graph structure

ACGTA CGTAC GTACG TACGT ACGTC CGTCA

Increasing k leads to better repeat resolution

de Bruijn graph What if we have sequencing errors?

CGTACGTC GTACGGCA

Additional nodes

GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA

TACGG ACGGC CGGCA

A quick example

GTCG (1x)

TCGA (1x)

CGAG (1x)

GAGG (1x)

One read: GTCGAGG

A quick example

AGAT (8x)

ATCC (7x)

TCCG (7x)

CCGA (7x)

CGAT (6x)

GATG (5x)

ATGA (8x)

TGAG (9x)

GATC (8x)

GATT (1x)

TAGT (3x)

AGTC (7x)

GTCG (9x)

TCGA (10x)

GGCT (11x)

TAGA (16x)

AGAG (9x)

GAGA (12x)

GACA (8x)

ACAG (5x)

GCTT (8x)

GCTC (2x)

CTTT (8x)

CTCT (1x)

TTTA (8x)

TCTA (2x)

TTAG (12x)

CTAG (2x)

AGAC (9x)

AGAA (1x)

CGAG (8x)

CGAC (1x)

GAGG (16x)

GACG (1x)

AGGC (16x)

ACGC (1x)

All the others…

A quick example

AGAT (8x)

ATCC (7x)

TCCG (7x)

CCGA (7x)

CGAT (6x)

GATG (5x)

ATGA (8x)

TGAG (9x)

GATC (8x)

GATT (1x)

TAGT (3x)

AGTC (7x)

GTCG (9x)

TCGA (10x)

GGCT (11x)

TAGA (16x)

AGAG (9x)

GAGA (12x)

GACA (8x)

ACAG (5x)

GCTT (8x)

GCTC (2x)

CTTT (8x)

CTCT (1x)

TTTA (8x)

TCTA (2x)

TTAG (12x)

CTAG (2x)

AGAC (9x)

AGAA (1x)

CGAG (8x)

CGAC (1x)

GAGG (16x)

GACG (1x)

AGGC (16x)

ACGC (1x)

All the others…

A quick example

TAGTCGA

AGAGA TAGA

GCTTTAG

GCTCTAG

AGACAG

CGACGC

GAGGCT

GATCCGATGAG

After simplification…

Error removal

TAGTCGA

AGAGA TAGA

GCTTTAG

GCTCTAG

AGACAG

GAGGCT

GATCCGATGAG

Tips removed…

Error removal

TAGTCGA

AGAGA TAGA

GCTTTAG AGACAG

GAGGCT

GATCCGATGAG

Bubbles removed …

Error removal

TAGTCGAG AGAGACAG

AGATCCGATGAG

GAGGCTTTAGA

Final simplification…

de Bruijn graph What if we have sequencing errors?

CGTACGTC GTACCTCA

Additional nodes

GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA

TACCT ACCTC CCTCA GTACC

& graph disconnected

Choice of k: Tradeoff between error tolerance and repeat resolution

What is a good assembly? Some assembly evaluation metrics:

- High N50 contig size (most commonly used metric)

contigs sorted by size

N50 contig size

assembled as

Only if a reference sequence is known!

original sequence

- Low number of assembly errors (sequence errors, structural misjoins)

Conclusions - Most assemblers use either the OLC or de Bruijn graph paradigm,

both lead to NP-hard assembly models.

- However, assembler performance is independent of the underlying paradigm and mainly depends on heuristics for repeat resolution and handling noise.

- How to measure assembly accuracy is another important aspect, in general tradeoff between assembly contiguity and correctness.

- Evaluations show that assembly is far from solved, assembler performance still quite inconsistent.

“For large genomes, the choice of assemblers is often limited to those that will run without crashing” (GAGE paper, 2012)

References - GAGE: a critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg et

al., Genome Res. 2012 - Assemblathon 2: evaluating de-novo of genome assembly in three vertebrate species, Bradnam et

al., not yet published - Assembly algorithms for next-generation sequencing data. Jason R. Miller, Sergey Koren, Granger

Sutton, Genomics, 2010. - Fragment assembly string graph, Myers, 2005

Figures: - geneed.nlm.nih.gov

- http://paper-shredding-services-review.toptenreviews.com/ - www.illumina.com - A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation

Sequencing Technologies, Zhang et al., PLOS one, 2011

Titel, Datum

De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its...

Documents

Assemblathon 2: evaluating de novo methods of genome ... fileRESEARCH Open Access Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species Keith R

De novo genome assembly - The Kane LabDe novo genome assembly Look at your dataset Look at the amount and quality of your reads – How much data do you have? – How good is it? Look

MCB3895-004 Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly

De novo genome assembly versus mapping to a …beat.wolf.home.hefr.ch/documents/athen.pdf1 De novo genome assembly versus mapping to a reference genome as the method to use to identify

CHROMATIN CONFORMATION ANALYSISAND ITS ......2018/07/10 · 2. Improvement of de novo assembly Rkatsiteli de novo assembly summary Estimated genome size: 486,2 Mbp Number of scaffolds:

AdvSeq.Whole Genome Assembly and Alignmentschatzlab.cshl.edu/teaching/2013/AdvSeq.Whole Genome Assembly … · Milestones in Genome Assembly 2000. Myers et al. 1st Large WGS Assembly

NOVOPlasty: de novo assembly of organelle …homepages.ulb.ac.be/.../Dierckxsens2016.pdfProducing organelle genome assembly from whole genome sequencing (WGS) data would be the most

Chromosome-scale scaffolding of de novo genome assemblies ... · de novo. genome assembly. HiC and related protocols use proximity ligation and massively parallel sequencing to probe

De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal

Combining de Bruijn graph, overlap graph and microassembly for de novo genome assembly

De novo genome assembly of NGS data - VBC | Victorian ... novo assembly of NGS data... · De novo genome assembly of NGS data ... E.coli 4.6 Mbp • Simple Eukaryotes ... – Analyse/simplify/clean

Chromosome-scale scaffolding of de novo genome assemblies …krishna.gs.washington.edu/documents/burton_nature_biotechnology... · de novo. genome assembly. HiC and related protocols

De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell

De novo genome assembly of NGS data - VBC | Victorian

De novo whole-genome assembly in interspecific hybrid table … · 2019. 8. 19. · De novo whole-genome assembly in interspecific hybrid table grape, ‘Shine Muscat’ Kenta Shirasawa1*,

Combining PacBio with short read technology for improved de novo genome assembly

A near perfect de novo assembly of a eukaryotic genome

Genome Assembly

Comparing De Novo Genome Assembly: The Long and Short of It · Comparing De Novo Genome Assembly: The Long and Short of It Giuseppe Narzisi1*, Bud Mishra1,2 ... graphs, two dominant

De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University