Torsten Seemann - de novo genome assembly

De novo genome assembly

Dr Torsten Seemann

IMB Winter School - Brisbane – Mon 7 July 2014

Introduction

Ideal world

I would not need to give this talk!

AGTCTAGGATTCGCTACAGATTCAGGCTCTGAAGCTAGATCGCTATGCTATGATCTAGATCTCGAGATTCGTATAAGTCTAGGATTCGCTATAGATTCAGGCTCTGATATAT

Human DNA iSequencer™

46 complete haplotype

chromosome sequences

Real world

•  Can’t sequence full-length native DNA –  no instrument exists (yet)

•  But we can sequence short fragments

– 100 at a time (Sanger) – 100,000 at a time (Roche 454) – 1,000,000 at a time (PGM) – 10,000,000 at a time (Proton, MiSeq) – 100,000,000 at a time (HiSeq)

De novo assembly

The process of reconstructing the original DNA sequence from the fragment reads alone.

•  Instinctively like a jigsaw puzzle

– Find reads which “fit together” (overlap) – Could be missing pieces (sequencing bias) – Some pieces will be dirty (sequencing errors)

An example

A small “genome”

Friends, Romans, countrymen, lend me your ears;

I’ll return them

tomorrow!

Shakespearomics •  Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

Oops! I dropped

them.



•  Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

I’m good with words.



•  Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

•  Majority consensus Friends, Romans, countrymen, lend me your ears;

We have a consensus!

So far, so good.

The awful truth

“Genome assembly is impossible.”

A/Prof. Mihai Pop World leader in de novo assembly research.

He wears glasses so he must be

smart :-P

Methods

Approaches

•  greedy assembly •  overlap :: layout :: consensus •  de Bruijn graphs •  string graphs •  seed and extend

… all essentially doing the same thing, but taking different short cuts.

Assembly recipe

•  Find all overlaps between reads – hmm, sounds like a lot of work…

•  Build a graph – a picture of read connections

•  Simplify the graph – sequencing errors will mess it up a lot

•  Traverse the graph –  trace a sensible path to produce a consensus

Clean graph

Find read overlaps •  If we have N reads of length L

– we have to do ½N(N-1) ~ O(N²) comparisons – each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!

•  What counts as “overlapping” ? – minimum overlap length eg. 20bp – minimum %identity across overlap eg. 95% – choice depends on L and expected error rate

What we are up against!

What ruins the graph? •  Read errors

–  introduce false edges and nodes

•  Non-haploid organisms – heterozygosity causes lots of detours

•  Repeats –  if longer than read length – causes nodes to be shared, locality confusion

Graph simplification

•  Squash small bubbles – collapse small errors (or minor heterozygosity)

•  Remove spurs

– short “dead end” hairs on the graph

•  Join unambiguously connected nodes –  reliable stretches of unique DNA

Graph traversal •  For each unconnected graph

–  at least one per replicon in original sample

•  Find a path which visits each node once –  Hamiltonian path/cycle is NP-hard (this is bad) –  solution will be a set of paths which terminate at

decision points

•  Form a consensus sequences from paths –  use all the overlap alignments –  each of these collapsed paths is a contig

Contigs

Contiguous, unambiguous stretches of assembled DNA sequence

•  Contigs ends correspond to – Real ends (for linear DNA molecules) – Dead ends (missing sequence) – Decision points (forks in the road)

Repeats

What is a repeat?

A segment of DNA which occurs more than once in the genome sequence

•  Very common – Transposons (self replicating genes) – Satellites (repetitive adjacent patterns) – Gene duplications (paralogs)

Effect on assembly

The repeated element is collapsed into a single contig

Repeat mis-assembly

a b c

a c b

a b c d I II III

I

II

III a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement

The law of repeats

•  It is impossible to resolve repeats of length S unless you have reads longer than S.

•  It is impossible to resolve repeats of

length S unless you have reads longer than S.

Scaffolding

Beyond contigs

Contig sizes are limited by: •  the length of repeats in your genome

– can’t change this!

•  the length (or “span”) of the reads – wait for new technology – use “tricks” with existing technology

Paired reads •  DNA fragment (200-800 bp) ==============================

•  Single end -------->=====================!

•  Paired end (up to 800 bp span) ----->==================<-----!

•  Mate pair (up to 20 kbp span) ---->========/+/=========<----!

Scaffolding

•  Paired-end reads – known sequences at either end –  roughly known distance between ends – unknown sequence between ends

•  Most ends will occur in same contig –  if our contigs are longer than pair distance

•  Some ends will be in different contigs – evidence that these contigs are linked!

Contigs to scaffolds

Contigs

Paired-end read

Scaffold Gap Gap

Assessment

Assessing assemblies

•  We desire – Total length similar to genome size – Fewer, larger contigs – No mistakes (mis-assemblies)

•  Metrics – No generally useful objective measure – Longest contig, total bp, N50, …

The “N50”

The length of that contig from which 50% of the bases are in it and shorter contigs

•  Imagine we got 7 contigs with lengths: – 1,1,3,5,8,12,20

•  Total – 1+1+3+5+8+12+20 = 50

•  N50 is the “halfway sum” = 25 – 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12

N50 concerns

•  Optimizing for N50 –  encourages mis-assemblies!

•  An aggressive assembler may over-join: – 1,1,3,5,8,12,20 (previous) – 1,1,3,5,20,20 (now) – 1+1+3+5+20+20 = 50 (unchanged)

•  N50 is the “halfway sum” (still 25) – 1+1+3+5+20= 30 (≥ 25) so N50 is 20 (was 12)

Validation

•  Self consistency – Align read back to contigs – Check for errors or discordant pairs

•  Second opinion

– Use two complementary sequencing methods – Target troublesome areas for PCR – Use a genome wide “optical map”

How can I play?

Considerations •  Size of genome

– bacteria, eukaryote, meta-genome •  Hardware

– phone, laptop, desktop, server, cloud – RAM is more limiting than CPU

•  Operating system – Linux, Mac, Windows

•  Software budget –  commercial, free, open-source

Recommendations •  SPAdes

– Unix command-line (Mac, Linux)

•  VAGUE (Velvet) – Unix GUI (Mac, Linux)

•  CLC Genomics Workbench

– Java GUI (Windows, Mac, Linux) – Commercial product

Online tutorial

•  The GVL – Genomics Virtual Laboratory – http://genome.edu.au

•  Protocols – Microbial de novo assembly for Illumina data – Written by Simon Gladman (VBC/LSCC) – https://genome.edu.au/wiki/Protocols

Contact

•  Email –  [email protected]

•  Blog

– TheGenomeFactory.blogspot.com

•  Web – vicbioinformatics.com – vlsci.org.au/lscc – genome.edu.au

Torst!

~10!

Science

Torsten Seemann - de novo genome assembly