21
Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Embed Size (px)

Citation preview

Page 1: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Assembling Genome

Timothee Cezard

EBI NGS workshop16/10/2012

Page 2: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

2

Page 3: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Assembly Algorithms

• Goal: Find the shortest common sequence of a set of reads.

• This is NP-hard problem, we need to use some approximation algorithm.

Main algorithm used:• Overlap Layout Consensus• Debrujin graphs

Page 4: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Overlap-layout-consensusStep 1: Find Overlapping Reads

Need efficient alignment algorithmDoesn’t scale well when number of read is highUse seed based alignment with extension

TACATAGATTACACAGATTACTGA

|| ||||||||||||||||||||

TAGTTAGATTACACAGATTACTAGA

Page 5: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Overlap-layout-consensusStep 2: Construct overlap graph

• A graph is constructed:(1) Nodes are reads(2) Edges represent overlapping reads

CGTAGTGGCAT

ATTCACGTAG

Overlap graph

Page 6: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Try to find the Hamiltonian path:• a path in the graph contains each node exactly once.• Expensive computationally

Overlap-layout-consensusStep 3: Find Contigs

CGTAGTGGCAT

ATTCACGTAG

Page 7: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Overlap-layout-consensus

• This approach is used in Celera (CABOG), Newbler, Mira, SGA…

• It is mostly used with Sanger or 454 data.

• Can’t assemble repeat longer than read length

• Could come back if read gets longer.

Page 8: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

De Bruijn Graphs example

“It was the best of times, it was the worst of times, it was the age of

wisdom, it was the age of foolishness, it was the epoch of belief, it was

the epoch of incredulity,.... “

Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall

Velvet example courtesy of J. Leipzig 2010

Page 9: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

De Bruijn Graphs exampleitwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…

Generate random ‘reads’ How do we assemble?

Traditional all-vs-all comparisons of datasets this size require immense computational resources.

De Bruijn solution: Construct a graph efficiently

fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe theageofwi foolishnes incredulit ofbeliefit chofincred beliefitwa beliefitwa wisdomitwa eageoffool eoffoolish itwastheag mesitwasth epochofinc ssitwasthe itwastheep astheageof stheageoff sitwasthee thebestoft oolishness heepochofb ochofbelie wastheepoc bestoftime mesitwasth ebestoftim pochofincr

…etc. to 10’s of millions of reads

Page 10: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

De Bruijn GraphsStep 1: create kmer

Step 1: “Kmerize” the data

Reads: theageofwi

age

geo

eof

ofw

fwi

sthebestof

sth

the

heb

ebe

bes

est

sto

tof

astheageof

ast

sth

the

hea

eag

age

geo

eof

worstoftim

wor

ors

rst

sto

tof

oft

fti

tim

imesitwast

ime

mes

esi

sit

itw

twa

was

ast

…..etc for all reads in the dataset

Kmers :(k=3)

the

hea

eag

Page 11: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

De Bruijn GraphsStep2 Build the graph

age geo eof ofw fwihea eagthesth the

heb ebe bes est sto tof

ast sththe hea eag age geo eof

Look for k-1 overlaps: given by the reads

wor ors rststo tof

oft fti tim

ime mes

esisititwtwa

was

ast

…..etc for all ‘kmers’ in the dataset

Page 12: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

De Bruijn Graphsstep3: simplify the graph

Page 13: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

No single solution!Break the graph to give the final assembly

De Bruijn Graphsstep4: Create contigs

Page 14: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

De Bruijn exampleThe final assembly (k=3)

wor times itwasthe foolishness

incredulity age epoch be

st wisdom

of belief

A better assembly (k=10)itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…

Repeat with a longer “kmer” length

Why not always use longest ‘k’ possible?

Sequencing errors:

sthebentof

sth theheb

ebeben

entnto

tof

sthebentof

k=3

k=10100% wrong kmer

Mostly unaffected kmers

Page 15: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Strengths and problems of De Bruijn approach

Strengths:• No need to calculate the overlaps• Size of the final graph is function of the genome size• Repeats are collapsed

Problems:• Can only resolve k long repeat• Loose connectivity when create the contigs

Page 16: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Resolve repeat through scaffolding

Align reads from short insert or long insert library

Join contigs using evidence from paired end data

Contigs from assembly

Scaffold

Page 17: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

De Bruijn assembler

• Velvet: http://www.ebi.ac.uk/~zerbino/velvet/

• ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss

• SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html

• ALLPATH-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/

• IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

Page 18: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

What makes an assembly good?

• High coverage: 50 to 100X• Different but precise insert size libraries• Little to no sequencing errors• Avoid large number of variant.

• Try different assembler• Need a big fat memory machine (from 16Go to 1To)

Page 19: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

What makes your assembly better?

Error Correction: Correct the read before assemblyhttp://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full

• SOAP-denovo• Reptile: http://aluru-sun.ece.iastate.edu/doku.php?id=reptile • SGA: https://github.com/jts/sga

Joining overlapping reads:• COPE: ftp://ftp.genomics.org.cn/pub/cope/

• FLASH: http://genomics.jhu.edu/software/FLASH/index.shtml

Page 20: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

What makes your assembly better?

Tsai et al. Genome biology 2010

Gap Filling - Image

Page 21: Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012

Assembly validation

N50 is the most commonly used metric:Weighted median such as 50% of your assembly is

contained in contig of length >=N50

CEGMA: Core Eukaryotic Genes Mapping Approach• Looks in your assembly for gene that should be there• Usually best assembly have best CEGMA scorehttp://korflab.ucdavis.edu/datasets/cegma/

There are no magic tool