Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
WholegenomeshotgunsequencingDeNovoAssembly
2
StrategiestosequencelongDNAmolecules:ShotgunSequencing
TemplateDNA
1. RandomlybreaktemplateDNAintopieces
3
StrategiestosequencelongDNAmolecules:ShotgunSequencing
TemplateDNA
1. RandomlybreaktemplateDNAintopieces
4
StrategiestosequencelongDNAmolecules:ShotgunSequencing
1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentendsandsizeselect
5
StrategiestosequencelongDNAmolecules:ShotgunSequencing Sometimesadaptersequencesremain!
1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments
Identifyingthesesequencesissimplewhenweignorethecomplexityofthesearch
Theproblemis,whatsequence(s)arewelookingfor?
6
StrategiestosequencelongDNAmolecules:ShotgunSequencing Sometimesadaptersequencesremain!
1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments
Identifyingthesesequencesissimplewhenweignorethecomplexityofthesearch
Clipadaptersequences
7
StrategiestosequencelongDNAmolecules:ShotgunSequencing
ReadPair1ReadPair2ReadPair3ReadPair4ReadPair5ReadPair6ReadPair7ReadPair8ReadPair9ReadPair10ReadPair11
1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments4. Identifyandremoveadapterpartfromthedeterminedsequences5. Reconstructtemplatesequencefromthesequencereads
Reconstructtemplate
8
1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments4. Removeadapterpartfromthedeterminedsequences5. Reconstructtemplatesequencefromthesequencereads
1. Referenceguidedsequenceassembly:mapreadstoareferencesequence2. De-novosequenceassembly:determineoverlapbetweensequencereadsandassemble
overlappingsequencesintocontigs.
Contig1 Contig2 Contig3
collapse
collapse
collapse
StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereads
9
ReadPair11ReadPair2
ReadPair5
Contig1 Contig2 Contig3
Scaffold1NN
1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments4. Removeadapterpartfromthedeterminedsequences5. Reconstructtemplatesequencefromthesequencereads
1. Referenceguidedsequenceassembly:mapreadstoareferencesequence2. De-novosequenceassembly:determineoverlapbetweensequencereadsandassemble
overlappingsequencesintocontigs.Matepairinformationcanthenbeusedtobuildsuper-contigs(scaffolds)fromphysicallynon-overlappingcontigs.
Scaffold2NNNN NN
StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereads
10
1. Coverage:TheaveragenumberofreadscoveringapositioninthesequencedtemplateDNA.Lengthofgenomicsegment:LNumberofreads:nCoverage C=nl/LLengthofeachread:l
Howmuchcoverageisenough?Lander-Watermanmodel: Assuminguniformdistributionofreads,C=10resultsin1gappedregionper1,000,000nucleotides->Thisisnomorethanacruderuleofthumbandgreatlydependsonreadlength,repeatcompositionofthetemplateDNA,etc.
Contig1 Contig2 Contig3
collapse
collapse
collapse
StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereadsSomesummarystatisticstodescribeassemblies
Thehigherthecoveragethebetter(providedunlimitedcomputationalresources)!Themoreuniformthecoveragedistributionthebetter!
11
2. N50-size:Morethan50%ofthebasesinyourassemblyresideincontigswithatleastthesizedeterminedbytheN50value.NOTE:YoucanofcoursespecifyanyotherN-value.
Contig1 Contig2 Contig3
collapse
collapse
collapse
StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereadsSomesummarystatisticstodescribeassemblies
WhatnowtellsustheN50sizeexactly?Isitaqualitymeasureaspeoplefrequentlyuseit?
WhendoesitmakesensetomentiontheN50size(justconsiderRNAseqassemblies)?
12
3. Contiglengthdistribution
Contig1 Contig2 Contig3
collapse
collapse
collapse
StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereadsSomesummarystatisticstodescribeassemblies
Note:Allthesestatisticscanbeusedforscaffoldsaswell!
Literatureonde-novoassemblies
• J.R.Milleretal.Genomics95(2010)315–327• P.Compeauetal.NatureBiotechnology29(2011)987-991
• ZerbinoandBirney.GenomeRes18(2008)821-829Velvet
DeNovoAssembly
Theassemblyproblem…
JLFB
JLFB
Overlap-Layout-ApproachAssemblers:ARACHNE,PHRAP,CAP,TIGR,CELERA,MIRA
Overlap:findpotentiallyoverlappingreads
Layout:mergereadsintocontigsandcontigsintosupercontigs
Consensus:derivetheDNAsequenceandcorrectreaderrors ..ACGATTACAATAGGTT..
Overlap
• Findthebestmatchbetweenthesuffixofonereadandtheprefixofanother
• Duetosequencingerrors,needtousedynamicprogrammingtofindtheoptimaloverlapalignment
• Applyafiltertoremovepairsoffragmentsthatdonotshareasignificantlylongcommonsubstring
Differentapproachestothesequenceassemblyproblem
modfiedfromCompeauetal.(2011)NatureBiotechnology29(11)
Overlapbasedassembly➢ readidentityismaintained➢ intuitive➢ Readscanbeorganisedinanoverlapgraph➢ Graphcomplexityincreaseswithcoverage,
thusreadredundancyinflatesthegraph
Word-basedapproaches➢ readidentityis(temporarily)
lost…➢ ReadsareorganisedindeBruijn
graphs➢ Graphcomplexitydependson
wordsize➢ Graphcomplexityis(byand
large)independentfromcoverage,readredundancyisnaturallyhandled
➢ repeatsarerepresentedonlyonceinthegraphwithexplicitlinkstothedifferentstartandendpoints
De-NovoSequenceAssembly: CAP3
18
ShotgunSequenceReads
1234567
De-NovoSequenceAssembly: CAP3
19
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
1234567
1 2 3 4 5 6 7
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
20
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
234567
1 2 3 4 5 6 7
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
21
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
1
234567
1 2 3 4 5 6 7
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
22
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
1
234567
1 2 3 4 5 6 7
Removethetrivialsolution(alignmentagainstitself)
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
23
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
234567
1 2 3 4 5 6 7
Removethetrivialsolution(alignmentagainstitself)
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
24
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
1
234567
1 2 3 4 5 6 7
1
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
25
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
1
234567
1 2 3 4 5 6 7
1
1
3 5
1Candidatepairsforread1:
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
26
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
34567
1
3 5
1Candidatepairsforread1:
2
1 2 3 4 5 6 7
2
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
27
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
234567
1 2 3 4 5 6 7
2
1
3 5
1Candidatepairsforread1:
6
2Candidatepairsforread2:
2
3
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
28
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
4567
1 2 3 4 5 6 7
1
3 5
1Candidatepairsforread1:
6
2Candidatepairsforread2:
Candidatepairsforread3:
3
6
3
3
7
3
2
3
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments
29
ShotgunSequenceReads
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
4567
1 2 3 4 5 6 7
1
3 5
1Candidatepairsforread1:
6
2Candidatepairsforread2:
Candidatepairsforread3:6
3
7
3
Candidatepairsforread7:
.
.
.
2
3
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments:post-processing
30
1
3
5
1
6
2
6
3
7
3
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
3. Removepoorqualitysequenceends
4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.
2
3
4
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments:post-processing
31
1
3
5
1
6
2
6
3
7
3
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
3. Removepoorqualitysequenceends
4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.Evaluateaccordingtothefollowingcriteria:1. minimumlength2. minimumidentity3. minimumsimilarity4. numberofhigh-qualitymismatches
Removesequencepairsthatdonotmeetthethresholdsfor4.1to4.4
4
2
3
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments:post-processing
32
1
3
5
1
6
2
6
3
7
3
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
3. Removepoorqualitysequenceends
4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.Evaluateaccordingtothefollowingcriteria:1. minimumlength2. minimumidentity3. minimumsimilarity4. numberofhigh-qualitymismatches
Removesequencepairsthatdonotmeetthethresholdsfor4.1to4.4
4
2
3
De-NovoSequenceAssembly(CAP3) Searchforlocalalignments:post-processing
33
1
3
5
1
6
2
6
3
7
1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.
2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.
3. Removepoorqualitysequenceends
4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.Evaluateaccordingtothefollowingcriteria:1. minimumlength2. minimumidentity3. minimumsimilarity4. numberofhigh-qualitymismatches
Removesequencepairsthatdonotmeetthethresholdsfor4.1to4.4
4
2
3
1
3
16
2
6
3
2
3
CAP3:ContigBuilding
5
4 7
34
5
16
2
6
3
CAP3:ContigBuilding
1
3
2
4 7
35
5
16
2
6
CAP3:ContigBuilding
1
3
2
4 7
36
5
6
CAP3:ContigBuilding
13
2
6
2
4 7
37
5
6
CAP3:ContigBuilding
13
26
2
1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise
analysis(Greedyalgorithmindecreasingorderofoverlapscores).
2) Inasimpleview:Checkthelayoutforincompatibilities.
4 7
38
5
6
CAP3:ContigBuilding
13
26
2
1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise
analysis(Greedyalgorithmindecreasingorderofoverlapscores).
2) Inasimpleview:Checkthelayoutforincompatibilities.
1) sequenceread1and2areincompatiblesincetheycouldnotbe
aligned.
4 7
39
5
6
CAP3:ContigBuilding
13
6
2
1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise
analysis(Greedyalgorithmindecreasingorderofoverlapscores).
2) Inasimpleview:Checkthelayoutforincompatibilities.
1) sequenceread1and2areincompatiblesincetheycouldnotbe
aligned.
2) resolveincompatibility
3) checkfornewpossiblelayouts
4 7
40
5
6
CAP3:ContigBuilding
13 2
1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise
analysis(Greedyalgorithmindecreasingorderofoverlapscores).
2) Inasimpleview:Checkthelayoutforincompatibilities.
1) sequenceread1and2areincompatiblesincetheycouldnotbe
aligned.
2) resolveincompatibility
3) checkfornewpossiblelayouts
4 7
41
5
6
Furtherstepsingenomesequencing
13 2
1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise
analysis(Greedyalgorithmindecreasingorderofoverlapscores).
2) Inasimpleview:Checkthelayoutforincompatibilities,remove
incompatiblereadsandalign.
3) Buildaconsensussequenceforeachcontigs.
4) Orderandorientcontigsifpossibleusingadditionalinformation,e.g.,
pairedendreads.
4 7
42
DeriveConsensusSequence
Derivemultiplealignmentfrompairwisereadalignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
Errorcorrectionbyweightedvoting
• Correcterrorsusingmultiplealignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA
C: 20C: 35T: 30C: 35C: 40
C: 20C: 35C: 0C: 35C: 40
• Scorealignments• Acceptalignmentswithgoodscores
A: 15A: 25A: 40A: 25-
A: 15A: 25A: 40A: 25A: 0
Consensus(cont’d)• Aconsensussequenceisderivedfromaprofileofthe
assembledfragments
• Asufficientnumberofreadsisrequiredtoensureastatisticallysignificantconsensus
• Readingerrorsarecorrected
SequenceassemblywithDeBruijngraphs
Sequenceassemblycanbeabstractedtotheshortestsuperstringproblem
(NicolasdeBruijn1946)
Problem:findtheshortest(circular)superstringthatcontainsallpossiblesubstringsoflengthKoveragivenalphabet.
forK=4andatwoletteralphabetA={0,1}wehave16differentwords:
0000,0001,0010,0100,1000,0011,0110,1100,1001,1010,0101,0111,1011,1101,1110,1111
Sequenceassemblycanbeabstractedtotheshortestsuperstringproblem
(NicolasdeBruijn1946)
Problem:findtheshortest(circular)superstringthatcontainsallpossiblesubstringsoflengthKoveragivenalphabet.
forK=4andatwoletteralphabetA={0,1}wehave16differentwords:
0000,0001,0010,0100,1000,0011,0110,1100,1001,1010,0101,0111,1011,1101,1110,1111
Tosolvethisproblem,deBruijnborrowedfromEulerwhosolved1735the‘Königsberg’problem,i.e.thequestionwhetheritispossibletovisiteachislandbycrossingeachbridgeexactlyonce(Euleriancycle)
EulerianCycleProblem
• Findacyclethatvisitseveryedgeexactlyonce(Lineartime)
• AnEulerianCycleexistsifthenumberof‘outgoing’edgesforanodeequalsthenumberof‘incoming’edges*.
• Thegraphmayhave2nodeswithanoddnumberofedgesconnectedtoit.InthiscaseanEulerianpathratherthananEuleriancyclecanbefound.
*ProofbyC.Hierholzer
deBruijnsolvedtheproblembyrepresentingK-1mersasnodesandKmersasedgesinadirectedgraph.
Bydoingso,herelatedtheproblemoffindingashortestcommonsuperstringtothealreadysolvedproblemoffindinganEuleriancyclein
agraph.
001 011
100 110
000 010 101 111
0011
0010
0101
10111010
1101
0111
1111 1110
1100
10010001
10000000
0110
0100
I
II
III
IV
V
VI
VII
VIII
IXX
XI
XIIXIII
XIV
XV
XVI
001 011
100 110
000 010 101 111
0011
0010
0101
10111010
1101
0111
1111 1110
1100
10010001
10000000
0110
0100
I
II
III
IV
V
VI
VII
VIII
IXX
XI
XIIXIII
XIV
XV
XVI
Passingthroughtheedgesbyfollowingtheromannumbersreconstructsthesuperstringusingeachwordexactlyonce!
I:0000,II:0001,III:0011;IV:0110;V:1100;VI:1001;VII:0010;VIII:0101;IX:1011;X:0111;XI:1111;XII:1110;XIII:1101;XIV:1010;XV:0100;XVI:1000
0000110010111101
AdvancingtoDNAsequenceassemblyisstraightforward
modfiedfromCompeauetal.(2011)NatureBiotechnology29(11)
Classicaloverlapassembly(readidentityismaintained)
Kmerapproaches(readidentityislost)
DeBruijnGraphExample Shredreadsintok-mers(k=3)
G G A C T A A
G G A
G A C
A C T
C T A
T A A
G A C C A A A
G A C
A C C
C C A
C A A
A A A
53
Read1 Read2
GG(1x)
GA(1x)
AC(1x)
CT(1x)
TA(1x)
AA(1x)
GGA GAC ACT CTA TAA
GA(1x)
AC(1x)
CC(1x)
CA(1x)
AA(1x)
AA(1x)
GAC ACC CCA CAA AAA
DeBruijnGraphExample Mergeverticeslabeledbyidentical(k-1)-mers
Read1:
Read2:
54
ResultingGraph:GG(1x
GA(2x)
AC(2x)
CT(1x)
TA(1x)
AA(2x)
CC(1x)
CA(1x)
AA(1x)
GG(1x)
GA(1x)
AC(1x)
CT(1x)
TA(1x)
AA(1x)
GA(1x)
AC(1x)
CC(1x)
CA(1x)
AA(1x
AA(1x)
AnotherExample Constructthegraph(k=5)
AGAT(8x)
ATCC(7x)
TCCG(7x)
CCGA(7x)
CGAT(6x)
GATG(5x)
ATGA(8x)
TGAG(9x)
GATC(8x)
AAGT(3x)
AGTC(7x)
GTCG(9x)
TCGA(10x)
GGCT(11x)
TAGA(16x)
AGAG(9x)
GAGA(12x)
GACA(8x)
ACAA(5x)
GCTT(8x)
GCTC(2x)
CTTT(8x)
CTCT(1x)
TTTA(8x)
TCTA(2x)
TTAG(12x)
CTAG(2x)
AGAC(9x)
CGAG(8x)
CGAC(1x)
GAGG(16x)
GACG(1x)
AGGC(16x)
ACGC(1x)
55
Abranchingvertexiscausedbyeitherarepeatintheoriginalsequenceorasequencingerror
Sequencingerrorsaretypicallydetectedbyacoveragecutoffthreshold
Condenseunbranchedrunsinthegraph
AAGTCGA
TAGAGCTTTAG
GCTCTAG
GAGACAA
CGAG
CGACGC
GAGGCT
AGATCCGATGAG
56
AGAG
Correctsequencingerrorsusingacoveragethreshold
AAGTCGA
TAGAGCTTTAG
GAGACAA
CGAG
GAGGCT
AGATCCGATGAG
57
AGAG
Afterrecondensation
AAGTCGAG GAGACAAGAGGCTTTAGA
AGATCCGATGAG
58
AGAG
Source:SerafimBatzoglou
Anynon-branchingpathinthisgraphcorrespondstoacontigintheoriginalsequence.
Takingtheriskoffollowingarbitrarybranchingpathsmaycreatechimericspecies
BasicconceptsofdeBruijngraphbasedassemblers
• ThesequenceistreatedasaconsecutivestringofwordsoflengthK
• Sequencereadsarenolongerconsideredtorepresentaconsecutivestringofnucleotides.Thusreadlengthaswellasreadoverlapbecome,inprinciple,irrelevant.
• SequencereadsareonlyusedtoidentifywordsoflengthKoccurringinthesequence.
• Givenperfectdata–error-freeK-mersprovidingfullcoverageandspanningeveryrepeat–theK-mergraphwouldbeadeBruijngraphanditwouldcontainanEulerianpath,thatis,apaththattraverseseachedgeexactlyonce.
Themagic‘Kmer’givesmostusersofgraphbasedassemblyalgorithmsaveryhardtimeastheyhavetodecideonthesizeofK.
TogiveaninformedstatementweneedtomakesuretounderstandwhatKshouldrepresentandwhatthealgorithmicrequirementsofdeBruijngraphassemblersare
• Kmustrepresentawordthatoccursonlyonceinthesequencethatshouldbeassembled.Thus,Kmustbesufficientlylarge.
• deBruijngraphbasedassemblersassumethateachwordoflengthKoccurringinthegenomeisalsorepresentedinthegraph.AsKmersarecollectedfromafinitesetofsequencereads,Kmustnotbetoolarge.
• consideraDNAwordofK=2,howoftendoesitonaverageoccurinastringof16bp?
• HowaboutawordofK=25*
DefaultforTrinity(Grabherretal.2011)
Takehomemessage:IfKisonlysufficientlylargethechanceforanyKmertooccurmorethanonceina
(repeat-free)genomeapproaches0.
WhynotusingsimplythereadlengthasK?
WhyKmustnotbetoolarge
AGACTAGAGAATTGCGATAG
Asequenceoflength20contains11differentwordsoflength10!
Now,considerthesequenceisspannedby2readsoflength13:
AGACTAGAGAATTGCGATAGAGACTAGAGAATT
AGAATTGCGATAG
T:R1:R2:
Itiseasytoseethatnotall11wordsoflength10canbereconstructedwiththetworeads.ThisviolatesthekeyassumptionofthedeBruijngraphsItisalsoeasytoseethatreducingKamelioratestheproblemandeventuallygetsridofit(justconsiderK=1…)
k = 151 N50: 57 kbp
Reference: 40%
Assembly parameter optimization using maximization of average contig length (N50) as objective can
preclude an entire genome from being assembled
k = 51 N50: 22 kbp
Reference: 99%
Fungal kmers
Algal kmers & seq errors
Fungal kmersAlgal kmers
seq errors