64
Whole genome shotgun sequencing DeNovo Assembly

Whole genome shotgun sequencing DeNovo Assembly...1. Reference guided sequence assembly: map reads to a reference sequence 2. De-novo sequence assembly: determine overlap between sequence

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • WholegenomeshotgunsequencingDeNovoAssembly

  • 2

    StrategiestosequencelongDNAmolecules:ShotgunSequencing

    TemplateDNA

    1. RandomlybreaktemplateDNAintopieces

  • 3

    StrategiestosequencelongDNAmolecules:ShotgunSequencing

    TemplateDNA

    1. RandomlybreaktemplateDNAintopieces

  • 4

    StrategiestosequencelongDNAmolecules:ShotgunSequencing

    1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentendsandsizeselect

  • 5

    StrategiestosequencelongDNAmolecules:ShotgunSequencing
Sometimesadaptersequencesremain!

    1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments

    Identifyingthesesequencesissimplewhenweignorethecomplexityofthesearch

    Theproblemis,whatsequence(s)arewelookingfor?

  • 6

    StrategiestosequencelongDNAmolecules:ShotgunSequencing
Sometimesadaptersequencesremain!

    1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments

    Identifyingthesesequencesissimplewhenweignorethecomplexityofthesearch

    Clipadaptersequences

  • 7

    StrategiestosequencelongDNAmolecules:ShotgunSequencing

    ReadPair1ReadPair2ReadPair3ReadPair4ReadPair5ReadPair6ReadPair7ReadPair8ReadPair9ReadPair10ReadPair11

    1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments4. Identifyandremoveadapterpartfromthedeterminedsequences5. Reconstructtemplatesequencefromthesequencereads

    Reconstructtemplate

  • 8

    1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments4. Removeadapterpartfromthedeterminedsequences5. Reconstructtemplatesequencefromthesequencereads

    1. Referenceguidedsequenceassembly:mapreadstoareferencesequence2. De-novosequenceassembly:determineoverlapbetweensequencereadsandassemble

    overlappingsequencesintocontigs.

    Contig1 Contig2 Contig3

    collapse

    collapse

    collapse

    StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereads

  • 9

    ReadPair11ReadPair2

    ReadPair5

    Contig1 Contig2 Contig3

    Scaffold1NN

    1. RandomlybreaktemplateDNAintopieces2. Addadaptersofknownsequencetothefragmentends3. Sequence(typically)theendsofthefragments4. Removeadapterpartfromthedeterminedsequences5. Reconstructtemplatesequencefromthesequencereads

    1. Referenceguidedsequenceassembly:mapreadstoareferencesequence2. De-novosequenceassembly:determineoverlapbetweensequencereadsandassemble

    overlappingsequencesintocontigs.Matepairinformationcanthenbeusedtobuildsuper-contigs(scaffolds)fromphysicallynon-overlappingcontigs.

    Scaffold2NNNN NN

    StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereads

  • 10

    1. Coverage:TheaveragenumberofreadscoveringapositioninthesequencedtemplateDNA.Lengthofgenomicsegment:LNumberofreads:nCoverage C=nl/LLengthofeachread:l

    Howmuchcoverageisenough?Lander-Watermanmodel: Assuminguniformdistributionofreads,C=10resultsin1gappedregionper1,000,000nucleotides->Thisisnomorethanacruderuleofthumbandgreatlydependsonreadlength,repeatcompositionofthetemplateDNA,etc.

    Contig1 Contig2 Contig3

    collapse

    collapse

    collapse

    StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereadsSomesummarystatisticstodescribeassemblies

    Thehigherthecoveragethebetter(providedunlimitedcomputationalresources)!Themoreuniformthecoveragedistributionthebetter!

  • 11

    2. N50-size:Morethan50%ofthebasesinyourassemblyresideincontigswithatleastthesizedeterminedbytheN50value.NOTE:YoucanofcoursespecifyanyotherN-value.

    Contig1 Contig2 Contig3

    collapse

    collapse

    collapse

    StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereadsSomesummarystatisticstodescribeassemblies

    WhatnowtellsustheN50sizeexactly?Isitaqualitymeasureaspeoplefrequentlyuseit?

    WhendoesitmakesensetomentiontheN50size(justconsiderRNAseqassemblies)?

  • 12

    3. Contiglengthdistribution

    Contig1 Contig2 Contig3

    collapse

    collapse

    collapse

    StrategiestosequencelongDNAmolecules:Shotgunsequencingandde-novoassemblyofthesequencereadsSomesummarystatisticstodescribeassemblies

    Note:Allthesestatisticscanbeusedforscaffoldsaswell!

  • Literatureonde-novoassemblies

    • J.R.Milleretal.Genomics95(2010)315–327• P.Compeauetal.NatureBiotechnology29(2011)987-991

    • ZerbinoandBirney.GenomeRes18(2008)821-829Velvet

  • DeNovoAssembly

    Theassemblyproblem…

    JLFB

    JLFB

  • Overlap-Layout-ApproachAssemblers:ARACHNE,PHRAP,CAP,TIGR,CELERA,MIRA

    Overlap:findpotentiallyoverlappingreads

    Layout:mergereadsintocontigsandcontigsintosupercontigs

    Consensus:derivetheDNAsequenceandcorrectreaderrors ..ACGATTACAATAGGTT..

  • Overlap

    • Findthebestmatchbetweenthesuffixofonereadandtheprefixofanother

    • Duetosequencingerrors,needtousedynamicprogrammingtofindtheoptimaloverlapalignment

    • Applyafiltertoremovepairsoffragmentsthatdonotshareasignificantlylongcommonsubstring

  • Differentapproachestothesequenceassemblyproblem

    modfiedfromCompeauetal.(2011)NatureBiotechnology29(11)

    Overlapbasedassembly➢ readidentityismaintained➢ intuitive➢ Readscanbeorganisedinanoverlapgraph➢ Graphcomplexityincreaseswithcoverage,

    thusreadredundancyinflatesthegraph

    Word-basedapproaches➢ readidentityis(temporarily)

    lost…➢ ReadsareorganisedindeBruijn

    graphs➢ Graphcomplexitydependson

    wordsize➢ Graphcomplexityis(byand

    large)independentfromcoverage,readredundancyisnaturallyhandled

    ➢ repeatsarerepresentedonlyonceinthegraphwithexplicitlinkstothedifferentstartandendpoints

  • De-NovoSequenceAssembly: 
CAP3

    18

    ShotgunSequenceReads

    1234567

  • De-NovoSequenceAssembly: 
CAP3

    19

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    1234567

    1 2 3 4 5 6 7

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    20

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    234567

    1 2 3 4 5 6 7

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    21

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    1

    234567

    1 2 3 4 5 6 7

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    22

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    1

    234567

    1 2 3 4 5 6 7

    Removethetrivialsolution(alignmentagainstitself)

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    23

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    234567

    1 2 3 4 5 6 7

    Removethetrivialsolution(alignmentagainstitself)

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    24

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    1

    234567

    1 2 3 4 5 6 7

    1

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    25

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    1

    234567

    1 2 3 4 5 6 7

    1

    1

    3 5

    1Candidatepairsforread1:

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    26

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    34567

    1

    3 5

    1Candidatepairsforread1:

    2

    1 2 3 4 5 6 7

    2

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    27

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    234567

    1 2 3 4 5 6 7

    2

    1

    3 5

    1Candidatepairsforread1:

    6

    2Candidatepairsforread2:

    2

    3

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    28

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    4567

    1 2 3 4 5 6 7

    1

    3 5

    1Candidatepairsforread1:

    6

    2Candidatepairsforread2:

    Candidatepairsforread3:

    3

    6

    3

    3

    7

    3

    2

    3

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments

    29

    ShotgunSequenceReads

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    4567

    1 2 3 4 5 6 7

    1

    3 5

    1Candidatepairsforread1:

    6

    2Candidatepairsforread2:

    Candidatepairsforread3:6

    3

    7

    3

    Candidatepairsforread7:

    .

    .

    .

    2

    3

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments:post-processing

    30

    1

    3

    5

    1

    6

    2

    6

    3

    7

    3

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    3. Removepoorqualitysequenceends

    4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.

    2

    3

    4

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments:post-processing

    31

    1

    3

    5

    1

    6

    2

    6

    3

    7

    3

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    3. Removepoorqualitysequenceends

    4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.Evaluateaccordingtothefollowingcriteria:1. minimumlength2. minimumidentity3. minimumsimilarity4. numberofhigh-qualitymismatches

    Removesequencepairsthatdonotmeetthethresholdsfor4.1to4.4

    4

    2

    3

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments:post-processing

    32

    1

    3

    5

    1

    6

    2

    6

    3

    7

    3

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    3. Removepoorqualitysequenceends

    4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.Evaluateaccordingtothefollowingcriteria:1. minimumlength2. minimumidentity3. minimumsimilarity4. numberofhigh-qualitymismatches

    Removesequencepairsthatdonotmeetthethresholdsfor4.1to4.4

    4

    2

    3

  • De-NovoSequenceAssembly(CAP3) 
Searchforlocalalignments:post-processing

    33

    1

    3

    5

    1

    6

    2

    6

    3

    7

    1. Concatenatesequencesintoacombinedsequence.Readsareseparatedbyaseparationcharacter.

    2. Computehighscoringchainsofsegmentsbetweeneachreadandthecombinedsequenceusinglocalalignmentsearchtools.Identifycandidatepairs.Everypairiscountedonlyonce.

    3. Removepoorqualitysequenceends

    4. Computeglobalalignmentforthehighqualitysequencepairstoverifyoverlaps.Evaluateaccordingtothefollowingcriteria:1. minimumlength2. minimumidentity3. minimumsimilarity4. numberofhigh-qualitymismatches

    Removesequencepairsthatdonotmeetthethresholdsfor4.1to4.4

    4

    2

    3

  • 1

    3

    16

    2

    6

    3

    2

    3

    CAP3:ContigBuilding

    5

    4 7

    34

  • 5

    16

    2

    6

    3

    CAP3:ContigBuilding

    1

    3

    2

    4 7

    35

  • 5

    16

    2

    6

    CAP3:ContigBuilding

    1

    3

    2

    4 7

    36

  • 5

    6

    CAP3:ContigBuilding

    13

    2

    6

    2

    4 7

    37

  • 5

    6

    CAP3:ContigBuilding

    13

    26

    2

    1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise

    analysis(Greedyalgorithmindecreasingorderofoverlapscores).

    2) Inasimpleview:Checkthelayoutforincompatibilities.

    4 7

    38

  • 5

    6

    CAP3:ContigBuilding

    13

    26

    2

    1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise

    analysis(Greedyalgorithmindecreasingorderofoverlapscores).

    2) Inasimpleview:Checkthelayoutforincompatibilities.

    1) sequenceread1and2areincompatiblesincetheycouldnotbe

    aligned.

    4 7

    39

  • 5

    6

    CAP3:ContigBuilding

    13

    6

    2

    1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise

    analysis(Greedyalgorithmindecreasingorderofoverlapscores).

    2) Inasimpleview:Checkthelayoutforincompatibilities.

    1) sequenceread1and2areincompatiblesincetheycouldnotbe

    aligned.

    2) resolveincompatibility

    3) checkfornewpossiblelayouts

    4 7

    40

  • 5

    6

    CAP3:ContigBuilding

    13 2

    1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise

    analysis(Greedyalgorithmindecreasingorderofoverlapscores).

    2) Inasimpleview:Checkthelayoutforincompatibilities.

    1) sequenceread1and2areincompatiblesincetheycouldnotbe

    aligned.

    2) resolveincompatibility

    3) checkfornewpossiblelayouts

    4 7

    41

  • 5

    6

    Furtherstepsingenomesequencing

    13 2

    1) Generateagenerallayoutusingtheoverlappingreadsfromthepair-wise

    analysis(Greedyalgorithmindecreasingorderofoverlapscores).

    2) Inasimpleview:Checkthelayoutforincompatibilities,remove

    incompatiblereadsandalign.

    3) Buildaconsensussequenceforeachcontigs.

    4) Orderandorientcontigsifpossibleusingadditionalinformation,e.g.,

    pairedendreads.

    4 7

    42

  • DeriveConsensusSequence

    Derivemultiplealignmentfrompairwisereadalignments

    TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

    TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

    Derive each consensus base by weighted voting

  • Errorcorrectionbyweightedvoting

    • Correcterrorsusingmultiplealignment

    TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA

    C: 20C: 35T: 30C: 35C: 40

    C: 20C: 35C: 0C: 35C: 40

    • Scorealignments• Acceptalignmentswithgoodscores

    A: 15A: 25A: 40A: 25-

    A: 15A: 25A: 40A: 25A: 0

  • Consensus(cont’d)• Aconsensussequenceisderivedfromaprofileofthe

    assembledfragments

    • Asufficientnumberofreadsisrequiredtoensureastatisticallysignificantconsensus

    • Readingerrorsarecorrected

  • SequenceassemblywithDeBruijngraphs

  • Sequenceassemblycanbeabstractedtotheshortestsuperstringproblem 


    (NicolasdeBruijn1946)

    Problem:findtheshortest(circular)superstringthatcontainsallpossiblesubstringsoflengthKoveragivenalphabet.

    forK=4andatwoletteralphabetA={0,1}wehave16differentwords:

    0000,0001,0010,0100,1000,0011,0110,1100,1001,1010,0101,0111,1011,1101,1110,1111

  • Sequenceassemblycanbeabstractedtotheshortestsuperstringproblem 


    (NicolasdeBruijn1946)

    Problem:findtheshortest(circular)superstringthatcontainsallpossiblesubstringsoflengthKoveragivenalphabet.

    forK=4andatwoletteralphabetA={0,1}wehave16differentwords:

    0000,0001,0010,0100,1000,0011,0110,1100,1001,1010,0101,0111,1011,1101,1110,1111

    Tosolvethisproblem,deBruijnborrowedfromEulerwhosolved1735the‘Königsberg’problem,i.e.thequestionwhetheritispossibletovisiteachislandbycrossingeachbridgeexactlyonce(Euleriancycle)

  • EulerianCycleProblem

    • Findacyclethatvisitseveryedgeexactlyonce(Lineartime)

    • AnEulerianCycleexistsifthenumberof‘outgoing’edgesforanodeequalsthenumberof‘incoming’edges*.

    • Thegraphmayhave2nodeswithanoddnumberofedgesconnectedtoit.InthiscaseanEulerianpathratherthananEuleriancyclecanbefound.

    *ProofbyC.Hierholzer

  • deBruijnsolvedtheproblembyrepresentingK-1mersasnodesandKmersasedgesinadirectedgraph.

    Bydoingso,herelatedtheproblemoffindingashortestcommonsuperstringtothealreadysolvedproblemoffindinganEuleriancyclein

    agraph.

    001 011

    100 110

    000 010 101 111

    0011

    0010

    0101

    10111010

    1101

    0111

    1111 1110

    1100

    10010001

    10000000

    0110

    0100

    I

    II

    III

    IV

    V

    VI

    VII

    VIII

    IXX

    XI

    XIIXIII

    XIV

    XV

    XVI

  • 001 011

    100 110

    000 010 101 111

    0011

    0010

    0101

    10111010

    1101

    0111

    1111 1110

    1100

    10010001

    10000000

    0110

    0100

    I

    II

    III

    IV

    V

    VI

    VII

    VIII

    IXX

    XI

    XIIXIII

    XIV

    XV

    XVI

    Passingthroughtheedgesbyfollowingtheromannumbersreconstructsthesuperstringusingeachwordexactlyonce!

    I:0000,II:0001,III:0011;IV:0110;V:1100;VI:1001;VII:0010;VIII:0101;IX:1011;X:0111;XI:1111;XII:1110;XIII:1101;XIV:1010;XV:0100;XVI:1000

    0000110010111101

  • AdvancingtoDNAsequenceassemblyisstraightforward

    modfiedfromCompeauetal.(2011)NatureBiotechnology29(11)

    Classicaloverlapassembly(readidentityismaintained)

    Kmerapproaches(readidentityislost)

  • DeBruijnGraphExample
Shredreadsintok-mers(k=3)

    G G A C T A A

    G G A

    G A C

    A C T

    C T A

    T A A

    G A C C A A A

    G A C

    A C C

    C C A

    C A A

    A A A

    53

    Read1 Read2

    GG(1x)

    GA(1x)

    AC(1x)

    CT(1x)

    TA(1x)

    AA(1x)

    GGA GAC ACT CTA TAA

    GA(1x)

    AC(1x)

    CC(1x)

    CA(1x)

    AA(1x)

    AA(1x)

    GAC ACC CCA CAA AAA

  • DeBruijnGraphExample
Mergeverticeslabeledbyidentical(k-1)-mers

    Read1:

    Read2:

    54

    ResultingGraph:GG(1x

    GA(2x)

    AC(2x)

    CT(1x)

    TA(1x)

    AA(2x)

    CC(1x)

    CA(1x)

    AA(1x)

    GG(1x)

    GA(1x)

    AC(1x)

    CT(1x)

    TA(1x)

    AA(1x)

    GA(1x)

    AC(1x)

    CC(1x)

    CA(1x)

    AA(1x

    AA(1x)

  • AnotherExample 
Constructthegraph(k=5)

    AGAT(8x)

    ATCC(7x)

    TCCG(7x)

    CCGA(7x)

    CGAT(6x)

    GATG(5x)

    ATGA(8x)

    TGAG(9x)

    GATC(8x)

    AAGT(3x)

    AGTC(7x)

    GTCG(9x)

    TCGA(10x)

    GGCT(11x)

    TAGA(16x)

    AGAG(9x)

    GAGA(12x)

    GACA(8x)

    ACAA(5x)

    GCTT(8x)

    GCTC(2x)

    CTTT(8x)

    CTCT(1x)

    TTTA(8x)

    TCTA(2x)

    TTAG(12x)

    CTAG(2x)

    AGAC(9x)

    CGAG(8x)

    CGAC(1x)

    GAGG(16x)

    GACG(1x)

    AGGC(16x)

    ACGC(1x)

    55

    Abranchingvertexiscausedbyeitherarepeatintheoriginalsequenceorasequencingerror

    Sequencingerrorsaretypicallydetectedbyacoveragecutoffthreshold

  • Condenseunbranchedrunsinthegraph

    AAGTCGA

    TAGAGCTTTAG

    GCTCTAG

    GAGACAA

    CGAG

    CGACGC

    GAGGCT

    AGATCCGATGAG

    56

    AGAG

  • Correctsequencingerrorsusingacoveragethreshold

    AAGTCGA

    TAGAGCTTTAG

    GAGACAA

    CGAG

    GAGGCT

    AGATCCGATGAG

    57

    AGAG

  • Afterrecondensation

    AAGTCGAG GAGACAAGAGGCTTTAGA

    AGATCCGATGAG

    58

    AGAG

    Source:SerafimBatzoglou

    Anynon-branchingpathinthisgraphcorrespondstoacontigintheoriginalsequence.

    Takingtheriskoffollowingarbitrarybranchingpathsmaycreatechimericspecies

  • BasicconceptsofdeBruijngraphbasedassemblers

    • ThesequenceistreatedasaconsecutivestringofwordsoflengthK

    • Sequencereadsarenolongerconsideredtorepresentaconsecutivestringofnucleotides.Thusreadlengthaswellasreadoverlapbecome,inprinciple,irrelevant.

    • SequencereadsareonlyusedtoidentifywordsoflengthKoccurringinthesequence.

    • Givenperfectdata–error-freeK-mersprovidingfullcoverageandspanningeveryrepeat–theK-mergraphwouldbeadeBruijngraphanditwouldcontainanEulerianpath,thatis,apaththattraverseseachedgeexactlyonce.

  • Themagic‘Kmer’givesmostusersofgraphbasedassemblyalgorithmsaveryhardtimeastheyhavetodecideonthesizeofK.

    TogiveaninformedstatementweneedtomakesuretounderstandwhatKshouldrepresentandwhatthealgorithmicrequirementsofdeBruijngraphassemblersare

  • • Kmustrepresentawordthatoccursonlyonceinthesequencethatshouldbeassembled.Thus,Kmustbesufficientlylarge.

    • deBruijngraphbasedassemblersassumethateachwordoflengthKoccurringinthegenomeisalsorepresentedinthegraph.AsKmersarecollectedfromafinitesetofsequencereads,Kmustnotbetoolarge.

  • • consideraDNAwordofK=2,howoftendoesitonaverageoccurinastringof16bp?

    • HowaboutawordofK=25*

    DefaultforTrinity(Grabherretal.2011)

    Takehomemessage:IfKisonlysufficientlylargethechanceforanyKmertooccurmorethanonceina

    (repeat-free)genomeapproaches0.

    WhynotusingsimplythereadlengthasK?

  • WhyKmustnotbetoolarge

    AGACTAGAGAATTGCGATAG

    Asequenceoflength20contains11differentwordsoflength10!

    Now,considerthesequenceisspannedby2readsoflength13:

    AGACTAGAGAATTGCGATAGAGACTAGAGAATT

    AGAATTGCGATAG

    T:R1:R2:

    Itiseasytoseethatnotall11wordsoflength10canbereconstructedwiththetworeads.ThisviolatesthekeyassumptionofthedeBruijngraphsItisalsoeasytoseethatreducingKamelioratestheproblemandeventuallygetsridofit(justconsiderK=1…)

  • k = 151 N50: 57 kbp

    Reference: 40%

    Assembly parameter optimization using maximization of average contig length (N50) as objective can

    preclude an entire genome from being assembled

    k = 51 N50: 22 kbp

    Reference: 99%

    Fungal kmers

    Algal kmers & seq errors

    Fungal kmersAlgal kmers

    seq errors