39
Assembly in Practice: Part 2: DBG Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briey how you are using the slides. For original Keynote les, email me ([email protected]).

Assembly in Practice: Part 2: DBG

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Assembly in Practice: Part 2: DBG

Assembly in Practice: Part 2: DBGBen Langmead

Department of Computer Science

Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]).

Page 2: Assembly in Practice: Part 2: DBG

Assembly alternatives

Alternative 1: Overlap-Layout-Consensus (OLC) assembly

Alternative 2: De Bruijn graph (DBG) assembly

Overlap

Layout

Consensus

Error correction

De Bruijn graph

Scaffolding

Refine

Page 3: Assembly in Practice: Part 2: DBG

CGAAAAA

GAAAAAC

AAAAAAC

AAAAACC

AAAAACTAAAAACA

GAAAAAGAAAAAGG

TAAAAAT

AAAAATT

GGAAACA

GAAACAC

AAAACCA

AAAAAAA

AAAAAAG

TGAAAACGAAAACT

AAAAAGA

AAAAGAG

AAAAAGC

AAAAGCC

AAAAGGC

GAAAATGAAAATGT

TAAAATT

AAAATTT

GTATCTATATCTAC

AAACACA

CGCCATT

GCCATTA

AAAACAT

AAACATG

AAACCAT

TGACGGTGACGGTA

AAAACCGAAACCGC

AGCCCGC

GCCCGCA

GAAACGC

AAACGCA

AGAACGTGAACGTT

AAAACTG

AAACTGC

AAAACTTAAACTTT

AAGGTAA

AGGTAAC

AAAGAGT

GAAAGCA

AAAGCAA

AAAGCAG

AAAGCCC

AGAAGCG

GAAGCGC

AAAGGCG

TGAAGTT

GAAGTTC

GAAGTTA

CAAATAA

AAATAAA

TAAATAC

AAATACT

CAGCGAT AGCGATG

AAAATCA

AAATCAC

AAATCAT

CTTAGGT

TTAGGTC

GGAAAGC

CAAATGC

AAATGCA

AAATGTC

AAATGTA

AAAATTA

AAATTAC

GGAATTT

GAATTTG

AAACAAAAACAAAG

TAACAAC

AACAACC

AACAACG

AAGCGCG

AGCGCGT

AACACAG

AGAAAAA

GAAAAAA

AGACAGA

GACAGAT

CTAAATA

TGACAGTGACAGTG

CTGGTAA

TGGTAACCAACATC

AACATCC

AACATGT

TGACATTGACATTG

CGACCAA

GACCAAA

TGACCAC

GACCACA

GACCACC

CGACCAG

GACCAGG

CAACCAT

AACCATC

AACCATG

ATGGTGC

TGGTGCT

GTGGGAC

TGGGACG

CTGACGG

TGACGGG

TGACCTGGACCTGG

CCCCGCC

CCCGCCA

AAAAAATAAAAATG

CTGCTGG

TGCTGGC

TGCTGGT

TGACGAG

GACGAGA

CAACGATAACGATG

GGACGCA GACGCAA

TGACGCG

GACGCGT

GGACGCTGACGCTA

CAACGGG

AACGGGC

TAACGGT

AACGGTG

GAACGTA

AACGTAT

AAAACAGAAACAGC

ACGGTTC

CGGTTCC

CGACTAC

GACTACT

CGCAATGGCAATGG

CAGGTTG

AGGTTGC

GGACTCGGACTCGC

TAACTGCAACTGCC

GAACTGG

AACTGGT

TGACTTA

GACTTAG

GAACTTC

AACTTCT

AACTTTC

CGAGAAAGAGAAAA

GAGAAAT

GCGTAAACGTAAAT

TGCAACG

GCAACGG

AACCGGTACCGGTC

ACCACCA

CCACCAT

CCACCAC

CACCGTAACCGTAT

AAGAGTG

TAAATTA

AAATTAA

AAGCAAT

TAGTGCG

AGTGCGG

AAGCAGA

CTGCGATTGCGATG

AAGCCCG

CAAGCCG

AAGCCGC

ACACAGA

CACAGAA

GTAACGG

CGAGCTG

GAGCTGG

AGAAAAT

GTTCTGATTCTGAT

CGAACGTGAACGTC

AGAGGCAGAGGCAG

AAGGCGA

AAAGGTA

GAAGGTTAAGGTTT

TGAGTAA

GAGTAAA

AGAGTAC

GAGTACA

TGAACTG

ACCGCTG

CCGCTGA

CCTATATCTATATA

CGAGTGTGAGTGTT

CGAGTTAGAGTTAG

AAGTTCG

AATAAAA

CGAGGTA

GAGGTAA

AATACTT

ACATCAG

CATCAGT

TGATAGC

GATAGCA

GTCCTCT

TCCTCTC

CAATATAAATATAGCAATATC

AATATCA

CAATATGAATATGT

CGATATT

GATATTC

AATCACC

AAACAGTAACAGTT

GAATCCA

AATCCAC

CAATCCGAATCCGC

TTCCTGCTCCTGCG

CGATCGC

GATCGCC

ACGTATTCGTATTT

GAATCTA

AATCTAC

CAAAGCC

TAATGAA

AATGAAA

TGCGGGT

GCGGGTT

GCGGGTG

TAATGAGAATGAGA

CGATGAT

GATGATT

AATGCAG

CGATGCC

GATGCCG

CACACCAACACCAT

CTTTTTT

TTTTTTT

TGATGGC

GATGGCT

ACGGGGCCGGGGCT

AATGTCG

AATTAAA

AATTACA

GGCGCTGGCGCTGT

GAATTCCAATTCCG

AAATTCGAATTCGG

CAAAGGCAAAGGCA

TGATTGA

GATTGAA

TGATTTAGATTTAG

GGATTTC

GATTTCC

AATTTGC

AAATTTT

AATTTTA

GACAAAAACAAAAG

GGCAAAT

GCAAATG

TACAACA

ACAACAG

ACAACCA

GGCAAGC

GCAAGCC

ATGTCCT

TGTCCTG

GGCAATA

GCAATAT

AGCAATG

GCAATGC

CACAATT

ACAATTG

TACACAA ACACAAC

TCGTCGA

CGTCGAC

AACACATACACATT

AGCACCA

GCACCAC

GCACCAG

TAGCGAGAGCGAGA

CGCACCGGCACCGA

CGCACCT

GCACCTG

GACACGCACACGCC

CACCGAAACCGAAA

GTCGACCTCGACCA

ACAGAAA

CACAGAC

ACAGACA

TACAGAG

ACAGAGT

ACAGATA

AGCAGCGGCAGCGT

CGCAGCTGCAGCTG

TACAGGA

ACAGGAA

GGCAGGG

GCAGGGG

GCAGGGC

CACAGGT

ACAGGTA

TGCAGTAGCAGTAC

ACAGTGC

GACATACACATACA

GGCATAGGCATAGC

TACATCA

ACATCCA

CCGTTAT

CGTTATC

CACATGAACATGAT

GCTATATCTATATG

TGCATGG

GCATGGC

ACATGTC

CGCATTA

GCATTAG

CGCATTC

GCATTCC

GCATTCG

GGCATTGGCATTGC

TGAATCC

CGCCAAA

GCCAAAA

CACCAAC

ACCAACC

AACCAAT

ACCAATA

TACCACA

ACCACAG

CACCACC

TAAAAACCACCACT

ACCACTA

TACCAGCACCAGCA

ACCAGGA

CGAATCT

ACCATCT

ACCATGC

AACCATT

ACCATTA

TACCCAA

ACCCAAT

TACGGCCACGGCCG

CGCCCAG

GCCCAGC

TGCCCCC

GCCCCCA

GCCCCCG

GAAATGC

CACCCGCACCCGCC

ACCCGCA

TGCCCGG

GCCCGGA

AGCGATTGCGATTA

CGTCGAAGTCGAAT

CGCCCTGGCCCTGC

CAGTGCC

AGTGCCC

TGCCGAA

GCCGAAC

TACCGAGACCGAGG

GGAATGTGAATGTG

AGCCGCA

GCCGCAT

CGCCGCC

GCCGCCC

GCCGCCG

ACGGGCA

CGATCCG

GATCCGG

GTCAGCG

TCAGCGA

GGCCGGC

GCCGGCG

AGCCGGG

GCCGGGA

TGCCGGT

GCCGGTA

GCCGGTC

CGCCGTA

GCCGTAT

TACCGTCACCGTCA

TGCCGTG

GCCGTGA

GCCGTGG

TACCGTT

ACCGTTA

ACCGTTG

TTAATTGTAATTGA

TACCTCG

ACCTCGA

CAGTGGC

AGTGGCA

CACCTGA

ACCTGAC

ATCGCCA

TCGCCAT

TGCCTGGGCCTGGT

GGCGCGTGCGCGTT

CAGCGTGAGCGTGT

CACCTTTACCTTTA

GCGGGTCCGGGTCG

TCACAAT

TCGTGGT

CGTGGTC

CGTGGTT

CCCGCCG

CCGCCGT

TAAAAAA

CACAACA

GGCGAGA

GCGAGAA

GGCGAGC

GCGAGCT

AACGAGG

ACGAGGT

TGCGAGT GCGAGTG

AGCGATAGCGATAA

TAGCGCG

GCGATGA

GCGATGC

ACGGTGC

CGGTGCG

ACGCAAC

AGCGCAC

GCGCACA

AACGCAT

ACGCATT

CAAAAGCAAAAGCA

CCGTGTGCGTGTGG

CGCGCCG

GCGCCGA

TACGCGC

ACGCGCC

TTACACA

AACGCGTACGCGTG

AACGCTG

ACGCTGC

TACGGAGACGGAGT

ACGCGTA

TGCGGCA

GCGGCAA

GCGGCAG

GGCGGCCGCGGCCT

GGCGGCGGCGGCGT

GGCGGGAGCGGGAC

TGCGGGC

GCGGGCT

TACACAGACACAGT

TTTTTTC

TTTTTCG

GGCGGTA

GCGGTAC

AGCGGTCGCGGTCA

AACGGTT

ACGGTTA

GCACAGA

TGCGCTG

GCGCTGA

CATTACC

ATTACCT

ATTACCA

GGCGTGT

GCGTGTT

AACGTTA

ACGTTAC

GTACAGG

AACGTTTACGTTTT

CACTAAA

ACTAAAT

AAGTTTCAGTTTCG

GAGTTCG

AGTTCGG

TGAAAAA

ACTACTC

CTCTGCC

TCTGCCC

AGCTATGGCTATGA

TTACATCTACATCG

CCGTTGA

CGTTGAT

TACTCCAACTCCAC

TACTGAAACTGAAA

TACTCCG

ACTCCGC

AACTCCTACTCCTG

ACTCGCC

TGCAGAA

GCAGAAC

CAACATTAACATTC

AGCTGGT

GCTGGTG

TGCTGAA

GCTGAAT

GGCTGAC

GCTGACG

GCTGACC

GACTGAGACTGAGG

CGCTGAT

GCTGATT

GCTGATC

GACTGCA

ACTGCAA

CACTGCC

ACTGCCG

ACTGCCT

CGCTGCG

GCTGCGC

GCTGCGG

TTACCAC

TACCACC

CGCTGGC

GCTGGCA

ACTGGTT

TCCGCCCCCGCCCC

AACACCAACACCAC

AGCTGTTGCTGTTG

GTGCTGG

GACGGGA

ACGGGAC

ACTTAGG

CACCATC

CACCATT

CAGTTTG

AGTTTGT AGTTTGC

ACTTCTG

TCGCCGT

ACAGCGA

CAGAGTG AGAGTGT

TGCTTTA

GCTTTAC

ACTTTCG

TAAAACA

GGCTTTT

GCTTTTT

ATAAAAC

ATAAAAA

GGGAAATGGAAATA

CCACCCG

CAGAACG

GGGAAGCGGAAGCA

AGGAAGGGGAAGGG

CAAAATC

AGGAATT

CAGACAG

AGTAAAT

GTAAATT

ACCAAAG

TGCGGTG

GCGGTGC

GGGACGC

GGACGCG

GGGACTAGGACTAG

GGGACTC

GGAACCAGAACCAC

TTACCGT

CGCGTACGCGTACA

AGTAACTGTAACTT

CAGAGTA

ATGCCAATGCCAAG

TTACCTC

CAGATAA

AGATAAA

AGATAAT

CGGATAG

GGATAGC

TTACCTG

TACCTGC

CTGACGC

TGGATTA GGATTAA

GGGATTCGGATTCC

TGCCCGC

GCCCGCG

GGGATTT

TTAAAATTAAAATA

CGGCAAG

GGGCAAT

TGGCACAGGCACAA

TAGCACC

TGGCACGGGCACGA

TAGCAGC

AGCAGCT

AGGCAGG

TGGCAGTGGCAGTG

AGGCATA

GGCATAC

TAGCATC AGCATCA

CGGCCAA

GGCCAAA

TGGCCAC

GGCCACC

GTGGTCATGGTCAC

GAGCCGAAGCCGAT

TGGCCGG

CTGAGTGTGAGTGC

CGCGTGG

GCGTGGT

CGGCCTGGGCCTGT

AGGCGAG

TAGCGATTAGCGCA

CAGCGCG

CACCGCCACCGCCA

TGGCGGC

TTATCACTATCACC

CGGCGGT

CGTACAG

CGGCGTG

TGCTGAT

GCTGATG

GCCCGAACCCGAAT

TTCGCTGTCGCTGG

GGGCTCTGGCTCTT

GGGCTGA

ACCTGGTCCTGGTA

TTTACCC

TTACCCA

CAGGAAA

AGGAAAC

ACCCCGACCCCGAA

CAACGGT

TCTACCG

CTACCGT

CGGGACT

CAGGATAAGGATAG

CAGGATG

AGGATGC

CGGGATT

CGGGCAA

GGGGCAG

GGGCAGG

GGGCAGT

TAGGCAT

CGGGCCTGGGCCTT

CGGGCTG

CGGGCTT

GGGCTTT

CACCGCT

AGGGGCA

GGGGGCCGGGGCCT

CAGGGGC

CCTATTTCTATTTT

CTACTCC

CAGGTAA

GAGGTACAGGTACA

GGGGTAGGGGTAGC

TAGGTCA

AGGTCAC

TGGGTCA

GGGTCAT

GGGTCAG

GGGGTGAGGGTGAC

CAGGTGG

AGGTGGC

TATACTT

ATACTTT

TGGGTTAGGGTTAT

CAGGTTCAGGTTCT

CGGGTTG

GGGTTGC

GGTAACG

GGTAACA

GGTAACC

GAGTAAGAGTAAGT

CGGTAAT

GGTAATG

AGTACAC

AGTACAT

TTGAGGCTGAGGCA

TGGTAGC

GGTAGCG

GGTCACT

GGTCACC

CGGTCAG

GGTCAGG

GGTCATT

TGACTGC

CGGTCGA

GGTCGAA

TGGTCGTGGTCGTT

ATCGATC

TCGATCC

CCAGGAA

GCAGCTT

GTAGCGA

GGTGCGG

GGTGCTG

TAACCAA

GGTGGCC

CAGTGGG

AGTGGGT

TGGTGGT

GGTGGTT

GAGTGTC

AGTGTCT

CAGTGTGAGTGTGG

AGTGTTG

TGGTTAC

GGTTACC

TCACCACCACCACA

GGTTCCG

TGGTTCT

GGTTCTG

GGTTGCC

GGGTTGTGGTTGTC

TGGTTTC

GGTTTCA

CCAACCA

AGCAGGAGCAGGAA

GATAAAA

CATAAACATAAACG

ATTGAAA

CTCCGCT

TCCGCTG

GATAACAATAACAA

AATAACCATAACCA

GTAACGA

ATAGAAC

TAGAACA

GTAATGA

AATACAAATACAAC

GTACACA

CCATTAC

GGTACAT

GTACATC

GCCTGTGCCTGTGT

AATAGAA

ATAGCAT

ATAGCAG

CATAGCG

ATAGCGC

TATAGGCATAGGCA

GATAGTAATAGTAT

CGCAACG

CTGCCGG

AGTATACGTATACA

ATATAGG

ACGTCACCGTCACT

ATATCAG

CTCTCTG

TCTCTGC

TATATCGATATCGG

GTAACAA

ATATGTC

ATATTCT

GATATTG

ATATTGC

GTATTTT

ACTGCACCTGCACC

CATCAAC

ATCAACG

CAAAGGT

GGTCACA GTCACAA

ATCACCA

TTCTGAC

TCTGACG

TCTGACT

GTCAGGA

ATCAGTG

ATCAGTT

TGCCGAGGCCGAGA

GTCATTA

ATCCACC

ATCCACG

CCACCGT

CACCGTC

CATCCAT

ATCCATG

AGCTTTA

GATCCCGATCCCGA

ACCGTCC

ATCCGGT

CGTTCCGGTTCCGG

CGTCCTC

GTCCTGC

CGCTGCCGCTGCCC

GTCGAAA

TCACCAT

TATCGAT

CCAGCACCAGCACC

CCGTCCT

TGTCGGGGTCGGGG

AACTGCT

ACTGCTG

GATCGTAATCGTAT

TTCACCA

TCACCAA

CGTATTCGTATTCC

ATCTACC

CGTATTG

GTATTGC

GCTGGCG

TGTCTGA

GTCTGAT

ATCAGGTTCAGGTA

CATCTGG

ATCTGGT

CATCTGTATCTGTG

TGTCTTAGTCTTAG

TTTTCTG

TTTCTGC

CTTTCGT

ATGAAAA

ATGAAAG

GTTATGGTTATGGA

GTGCTGATGCTGAC

CGTGAGT

GTGAGTA

GGTGATGGTGATGG

ATGATTG

ATGCAGA

TGTCAATGTCAATC

AATGCCA

ATGCCAG

GTGCCCG

ATGCCGA

CATGCGA ATGCGAG

GTGCGGG

CATCACC

AATGGAAATGGAAA

TGTGGAT GTGGATT

GTGGCAA

GTGGCCA

CGTGGCG

GTGGCGA

ATGGCTG

GTGGGTCCATGGTG

GTGGTTC

GTCACTA

ACGGGTT

CATGTCC

ATGTCGA

TATGTCT

ATGTCTC

CACTCCG

GGGATTAGGATTAC

TGTGTGG

GTGTGGA

CGTGTTA

GTGTTAG

GTGTTGA

GATTAAAATTAAAA

ATTACAG

CGTTACC

GTTACCG

TATTACGATTACGG

CATTACTATTACTA

TGTTAGA

GTTAGAA

CATTAGC

ATTAGCA

ATTAGCG

AGCTTCT

GCTTCTG

AGTTATAGTTATAG

GTTATCG

CATTATG

ATTATGG

CGGTACA

CATTCCG

ATTCCGG

CCAGGCA

CAGGCAG

GTTCGGC

GGTTCTCGTTCTCG

TATTCTG

ATTCTGG

GTTTCGCTTTCGCT

TGTTGAA

GTTGAAG

TATTGACATTGACT

GTTGATA

GTTGCCG

TATTGCG

ATTGCGG

TATTGCT

ATTGCTG

TACTTTAACTTTAA

TTGATAT

TCCGGTC

CCGGTCG

TTACCCGTACCCGA

AGGCTGA

TGTTTACGTTTACG

GTTTCAC

ATTTCCG

AGTTTGAGTTTGAT

GATTTGCATTTGCC

GTTTGTT

ATTTTAT

CGTTTTC

GTTTTCT

TTTGCCG

TTGCCGA

TTGCCGT

TATTTTT

ATTTTTG

TTAAAAA

CCAAAAT

ACAAAGC

CCAAAGG

CCAAATA

TGGAAAG

CAACAGG

ACAACAT

TTAACCA

TAACGAG

TAACGAC

TCAACGC

CAACGCT

CAACGCC

ACAACGTCAACGTT

AATCCGG

CGCCGAT

GCCGATT

TTCCGCT

CGAGTAA

GACTTCT

CCGAACT

CGAACTT

GCGCGTG

TATCAGC

ATCAGCG

GCAATACCAATACG

TAACCGTAACCGTG

CCAATAT

AGCTTTT

GCTTTTC

CAATGCC

TTAATTATAATTAA

CCGAAGGCGAAGGT

CAATTGA

TCACAACCACAACG

GCACAAT

TTACAGA

TACAGAT

CAGAATAAGAATAC

CCACAGG

TCAGTCTCAGTCTG

GTACATATACATAA

CCACATG

CACATGG

TCGAATC

ACACCAGCACCAGC

TGTCGAT

GTCGATC

CCTGCCG

CTGCCGT

GCAGTGC

TTACCGCTACCGCT

GCAGTGG

GGATGCT

GATGCTT

TAGCGGT

AATTGAA

TTACGCG

TATTGATATTGATG

TTACGGTTACGGTG

TTACGTCTACGTCG

TCACTAA

CACTAAC

TGAAACG

GCACTATCACTATT

TACTCCC

GTTCCGA

TCACTGC

CACTGCA

ATACTGGTACTGGA

CAGAAAA

TTAGAAG

TAGAAGC

AGTCGGTGTCGGTC

CGATGGCGATGGCC

ATAGCAATAGCAAC

TTAGCAC

CCAGCCACAGCCAA

CCAGCCG

CAGCCGG

GCGACGCCGACGCC

TTAGCGG

TCGAAAA

GTGACGG

CAGCTTC

CAGGAAT

TCAGGAT

CAGGATC

CACATCA

CATTCTG

ATTCTGA

ACGACTA

ATATTACTATTACA

GCAGGTG

GCAGGTT

GCAGTGACAGTGAG

CAGTGCG

TCAGTGG

ACAGTTGCAGTTGA

TCAGTTT

ATATAAATATAAAA

CGATTCTGATTCTA

TGGCTGG

GGCTGGT

TGTCTCT

GTCTCTG

GCATCAA

TTCTGCG

TTATCGA

TGATATT

GTTCTGG

TTCTGGG

CCATCTG

TCGAGCCCGAGCCG

CCATGAA

CATGAAA

CATGAAC

ACATGACCATGACA

ACATGCACATGCAG

CCATGCG

CCATCAC

CATGGCA

CATGGCT

ACATGGT

GCATCAG

ATTCCGCTTCCGCC

CCCAACGCCAACGT

TCATTAC

CCATTAG

CCATTAT

CATTATC

GGTGCCGGTGCCGA

TCATTCT

TTATTGACGTCTTT

GTCTTTT

ATGAAAC

CCCAAAT

TCTGAAC

CTGAACT

GCGAGTACCCAATACCAATAG

GTCAATTTCAATTA

ACCACAT

TCACCAG

TCCACCC

CCACCCT

GCCACCG

GCCTTTTCCTTTTT

ACCACGA

CCACGAA

CCACGAC

CCATGGT

TTCACTG

TCTGACCCTGACCG

ACCAGAGCCAGAGG

CCCAGCC

GACATCA

GCCAGGC

TTGACTT

TTGACTC

TTGACTG

GTCATAATCATAAC

ACCATCA

CCATCAG

TCCATGA

CTCATGGTCATGGT

TTCATTC

CTCGCCG

TCGCCGC

GCCCAAA

AATGAGTATGAGTG

GCCCACGCCCACGC

CTTCTGA

CTTCTGG

CAACGCAAACGCAG

ATGGCAT

CCCCCAC

CCCCCGC

TGGCAAA

TATCCATATCCATC

ATGATGG

AATGCCC

TTCGGCG

TCGGCGG

TTCCGAC

TCCGACT

TCCGACA

CCCGCAC

CCCGCGC

GGCATCA

CCCGGAT

CCCGGAC

TTCCGGC

TCCGGCT

TTCCGTCTCCGTCG

TTATGGC

TATGGCC

TCCTGCA

AGTCACT

TTTGCCC

TTGCCCACCGAACG

TTCGAAGTCGAAGC

CTCGAAT

TTCGACC

CCGACTA

GCCGATA

CCGATAT

TCGATCG

GTTGCGGTTGCGGA

CCGATTG

CCGCACC

CCGCATT

CCGCCAA

CCGCCCA

CCGCCGC

CTGCGCT

AGGCACA

GCCGCGGCCGCGGT

CGCATGGGCATGGT

CCCCCGACCCCGAC

TTGCGGC

CCGCTGG

CCGCTGC

CCGGATA

CCGGCGT

CCGGCTG

CCGGGAT

CCGGTAA

CCGTATT

TTGCTGA

TTCGTCG

GGACACAGACACAC

CCGTGAG

CCGTGGC

GCCTGTT

GGCAGGT

TGCCGAT

ATGCTTT

TGGCAAGGGCAAGA

ACCTGCC

TCTGCGG

CTGCGGG

TGGCATC GGCATCC

GCCTCAACCTCAAC

GCTGACA

CTGACAG

TACAAAAACAAAAC

ACCTCCACCTCCAA

ATCTCCG

TCTCCGC

CCTCGAA

ACTGGCA

CTGGCAG

CCTCTCT

TCTCTGT

ATGGCCG

TTCTGAA

CCTGACA

TCTGATA

CCTGCAT

TTCTGGA

TCTGGAA

AAATTTAAATTTAT

TCTGGGA

TCTGGTA

CTCTGTG

TCTGTGT

CCTGTTT

CTAAGGTTAAGGTA

TGATTTG

TTGAAAA

TGCCCAA

GGATTTTGATTTTT

GTGACGA

GTGCCCC

GCGCAGTCGCAGTG

CTGCCCG

TGAACTT

TTGAAGATGAAGAA

GCTGGTA

TTGAAGT

CTGAATC

GCCAAATCCAAATT

TCGACCGCGACCGA

CTGACGATGACGAC

CTGGTTA

CTGACTG

AGCGCCAGCGCCAT

GGCAACG

CTGAGGC

TGAGGCG

AATGTAGATGTAGG

ATGATAATGATAAA

CTCAAATTCAAATA

CTGATAG

CTGATGG

CCGATGTCGATGTT

GAGCCGGAGCCGGT

CTGATTT

CGCACAG

TTGCGACTGCGACA

GTGTCTG

ACCAAACCCAAACG

TGGCGAG

CTGCATG

TGCCAGG

CGCCCAA

CTGCCCC

TGCCGGG

CCGCCTACGCCTAA

GCACAAA

CCGCGATCGCGATG

GCTGGTT

CTGCGGT

CCGCTCACGCTCAG

ATTGTTG

TTGTTGG

ACGCTGA

CTGGAAA

TTGCGGTTGCGGTA

ACGACCACGACCAC

CTGGCAC

CGGCTGA

CTGGCCCTGGCCCC

ACGGCGCCGGCGCG

GTTACCT

ACGGGAGCGGGAGT

ACGCCGCCGCCGCA

TTGGGGC

TGGGGCA

GCGGGTACGGGTAA

CTGGTAG

AATTAGC

ACCGATCCCGATCC

CGGTGCT

CTGGTGG

TGGTGGC

ACGGTTGCGGTTGA

CTGGTTT

GCGGCTG

GTTCGAATTCGAAA

ACGTAATCGTAATT

CTGGCGATGGCGAC

GGCAGAGGCAGAGG

GCCAAAG

CGCAGATGCAGATT

CCACTAG

CCACTAT

CAGCGGGAGCGGGC

TTTTATTTTTATTG

GTAACGTTAACGTA

TGTCTGC

GCGTCTTCGTCTTA

CAGCGTCAGCGTCC

CCGTGATCGTGATC

TGGAAGCGGAAGCG

TTGTGCCTGTGCCG

TTTCACT

AGTGAAAGTGAAAA

CTGTGTG

AATATTGATATTGG

CTGCAAC

TGCAACT

TGTTGAC

TGTTGGG

CTGTTTA

TTGTTTCTGTTTCG

CTGTTTGTGTTTGG

GTGGAAATGGAAAA

GGGTCGAGGTCGAC

TTTAACC

TACAGGT

TTGTTAG

TGCGCCG

TTTACGC

AGTGCTG

CCGGTTG

TGGCCGCGGCCGCC

TTTCGTC

TTAGCAA

TTAGCGT

TAACCACAACCACG

GTTAGTATTAGTAG

CTTATAGTTATAGC

TGTTGGCGTTGGCG

TTATCGG

GTTATGATTATGAA

TGCATCCGCATCCA

ACTCAACCTCAACA

TTTCACATTCACAG

GAGGCGT

CCTCAGACTCAGAT

TCGGATA

CTGACCA

TTTCATT

GTTCCAATTCCAAC

CGTAGCG

ACTCCCCCTCCCCG

GTCATGCTCATGCA

TTTCCGC

TTTCCTCTTCCTCG

TTTTGCC

TTTCGAC

GCCACCT

CTTCGGATTCGGAC

GTTCGGGTTCGGGC

CAACTTTAACTTTA

GTCAAAATCAAAAA

GTCATTC

AAAACAAAAACAAT

GTTGGGG

CCTGAAC

CTGATGT

AGTGCGC

CTTGCGATTGCGAG

GGCGAGT

CTGGCGG

ACTGGACCTGGACT

GCTGGATCTGGATT

ACTTTACCTTTACC

CTGGGAC

TTGGGGT

AACCAGCACCAGCT

AATTTCAATTTCAG

ATTGTGGTTGTGGC

TTTGTTG

TTGTTGA

CTTTAAATTTAAAC

CTTTAAC

ACTTAAGCTTAAGT

CCTAGGTCTAGGTC

CTTTTCTTTTTCTT

GGGACCCGGACCCG

AACAATTACAATTA

TTTTCAT

TTTTCGA

TTTTCGGTTTCGGC

AATTTGTATTTGTC

TTTTGCATTTGCAG

TGCCCACGCCCACA

GCTTGGCCTTGGCA

CCTTCTG

CCTTTAGCTTTAGC

TTTTATC

TTTTATG

CTTTTCA

TCAGTGC

TTTTTGC

TTTTTGG

CTTTTTC

Is it fast to build? Slow?Is it small? Big?

Page 4: Assembly in Practice: Part 2: DBG

De Bruijn graph

a_long_long_long_time Genome:

Reads:k-mers:

Pick k = 8

a_long_l _long_lo long_lon ong_long ng_long_ g_long_l _long_lo long_lon ong_long

g_long_t _long_ti long_tim ong_time

ng_long_ g_long_l

g_long_

_long_l _long_t

long_ti

ong_tim

long_lo

ong_lon

ng_long ng_time

a_long_

For each read:

For each k-mer:

Add k-mer’s left and right k-1-mers to graph if not there already. Draw an edge from left to right k-1-mer.

a_long_long_long, ng_long_l, g_long_time

Page 5: Assembly in Practice: Part 2: DBG

De Bruijn graph

To build graph: Pick k. Usually k is short relative to read length (k = 30 to 50 is common).

Sequencer outputs d reads of length n, total length N = dn.

d = 6 x 109 reads n = 100 nt

{

≈ 1 week-long run of

Illumina HiSeq 2000

For each read:

For each k-mer:

Add k-mer’s left and right k-1-mers to graph if not there already. Draw an edge from left to right k-1-mer.

Page 6: Assembly in Practice: Part 2: DBG

De Bruijn graph

To build graph: Pick k. Usually k is short relative to read length (k = 30 to 50 is common).

Sequencer outputs d reads of length n, total length N = dn.

d = 6 x 109 reads n = 100 nt

{

≈ 1 week-long run of

Illumina HiSeq 2000

# k-mers (edges): O(N)

# nodes is at most 2 ∙ (# edges); typically much smaller due to repeated k-1-mers

O(N)

Page 7: Assembly in Practice: Part 2: DBG

De Bruijn graph

How much work to build graph?

For each k-mer, add 1 edge and up to 2 nodes

Reasonable to say this is O(1) expected work

Say hash map holds nodes & edges

Say k-1-mers fit in O(1) machine words, and hashing O(1) words is O(1) work

Querying / adding a key is O(1) expected work

O(1) expected work for 1 k-mer, O(N) overall

g_long_

_long_l _long_t

long_ti

ong_tim

long_lo

ong_lon

ng_long ng_time

a_long_

Page 8: Assembly in Practice: Part 2: DBG

De Bruijn graph

Timed De Bruijn graph construction applied to progressively longer prefixes of lambda phage genome, k = 14

0 10000 20000 30000 40000 50000

0.00

0.05

0.10

0.15

0.20

Length of genome

Sec

onds

requ

ired

to b

uild

O(N) expectation works in practice

(in this case at least)

Page 9: Assembly in Practice: Part 2: DBG

De Bruijn graph

In typical assembly projects, average coverage is ~ 30 - 50

ng_l

g_lo a_lo

_lon

long

ong_

ng_t

g_ti

_tim

time

Page 10: Assembly in Practice: Part 2: DBG

De Bruijn graph

ng_l

g_lo a_lo

_lon

long

ong_

ng_t

g_ti

_tim

time

ng_l

g_lo

20

a_lo

_lon

10

long

ong_

30

20

ng_t

10

g_ti

_tim

10

10

20

30

time

10

Before: one edge per k-mer

After: one weighted edge per distinct k-mer

Page 11: Assembly in Practice: Part 2: DBG

De Bruijn graph

Say (a) reads are error-free, (b) we have one weighted edge for each distinct k-mer, and (c) length of genome is G

# of nodes and edges both O(N)

So # of nodes and edges are both O(G)

Can’t have more distinct k-mers than k-mers in the genome; likewise for k-1-mers

Combine with the O(N) bound and the # of nodes and edges are both O(min(N, G))

1 node per distinct k-1-mer, 1 edge per distinct k-mer

ng_l

g_lo

20

a_lo

_lon

10

long

ong_

30

20

ng_t

10

g_ti

_tim

10

10

20

30

time

10

Page 12: Assembly in Practice: Part 2: DBG

De Bruijn graph

At high coverage, O(min(N, G)) bound is advantageous

10 20 30 40 50

20000

40000

60000

80000

Average coverage

# de

Bru

ijn g

raph

nod

es +

edg

es

k = 30

Genome: lambda phage (~48,500 bp)

Draw random k-mers until target average coverage (x axis) is reached

Build graph, sum # nodes and # edges (y axis)

Page 13: Assembly in Practice: Part 2: DBG

De Bruijn graph

O(G) plateau

10 20 30 40 50

20000

40000

60000

80000

Average coverage

# de

Bru

ijn g

raph

nod

es +

edg

es

k = 30

Build graph, sum # nodes and # edges (y axis)

Genome: lambda phage (~48,500 bp)

Draw random k-mers until target average coverage (x axis) is reached

At high coverage, O(min(N, G)) bound is advantageous

Page 14: Assembly in Practice: Part 2: DBG

De Bruijn graph

Advantages

Can build in O(N) expected time, N = total length of reads

With error-free data, space is O(min(N, G)); G = genome length

Compares favorably with overlap graph

Fast construction (suffix tree) is O(N + a) time, where a is O(d2)

Overlap graph has node for every read, edge for every overlap

When average coverage is high, G ≪ N

Page 15: Assembly in Practice: Part 2: DBG

De Bruijn graph

Disadvantages

Reads are immediately split into shorter k-mers, losing the ability to resolve some repeats resolvable by overlap graph

We lose read coherence. Some paths through De Bruijn graph are inconsistent with respect to input reads.

Only relatively short, exact overlaps are considered, which makes handling of sequencing errors more complicated

Page 16: Assembly in Practice: Part 2: DBG

Assembly alternatives

De Bruijn Overlap

Time to build O(N)Suffix tree: O(N + a)

Dyn Prog: O(N2)

SpaceO(N)

Error-free: O(min(N, G))O(N + a)

n = # readsd = read lengthN = dn = # bases

When average coverage is high, G ≪ N and the G is the more relevant bound for De Bruijn graph size

a = # overlaps; a ∈ O(n2)G = source genome length

Page 17: Assembly in Practice: Part 2: DBG

Error correction

When data is error-free, # nodes, edges in De Bruijn graph is O(min(G, N))

10 20 30 40 50

0e+002e+044e+046e+048e+041e+05

Average coverage

# de

Bru

ijn g

raph

nod

es +

edg

es

What about data with sequencing errors?O(G) plateau

k = 30

Average Lambda phage coverage

# D

e Br

uijn

gra

ph e

dges

Page 18: Assembly in Practice: Part 2: DBG

Error correction

For large k, set of k-mers in genome is tiny subset of all 4k k-mers

Errors tend to yield new k-mers that don’t appear elsewhere

Given k-mer from genome, we expect most of its neighbors (e.g. by Hamming distance) are not in the genome

How many possible DNA strings of length k? 4k

How many possible DNA strings of length 20? 420 = 240 ≈ 1 trillionHow many strings of length 20 in human genome? ~3 billion

Analogy: correctly / incorrectly spelled words in collection of documents

Page 19: Assembly in Practice: Part 2: DBG

Error correction

Correcting errors up-front prevents De Bruijn graph from growing far beyond O(G) plateau

How to correct?

Analogy: how to spell check a language you’ve never seen before?

Errors tend to turn frequent words (k-mers) to infrequent ones. Corrections should do the reverse.

Page 20: Assembly in Practice: Part 2: DBG

Error correctionng_l

g_lo

20

a_lo

_lon

10

long

ong_

30

20

ng_t

10

g_ti

_tim

10

10

20

30

time

10

Left: Take example, mutate a k-mer character randomly with probability 1%

ng_l

g_lo

20

a_lo

_lon

9

lolg

olg_

1

long

ong_

29

onga

ngal

1

atlo

tlon

1

19

ng_t

10

g_ti

_tim

10

_l_n

l_ng

1

1

10

_loo

loog

1

20

27

time

10

ng_l

g_lo

20

a_lo

_lon

9

lolg

olg_

1

long

ong_

29

onga

ngal

1

atlo

tlon

1

19

ng_t

10

g_ti

_tim

10

_l_n

l_ng

1

1

10

_loo

loog

1

20

27

time

10

Right: 6 errors yield 10 new nodes, 6 new weighted edges, all with weight 1

Page 21: Assembly in Practice: Part 2: DBG

Error correction

Same experiment as before, with 5% error added

10 20 30 40 50

050000

150000

250000

Average coverage

# de

Bru

ijn g

raph

nod

es +

edg

es

0% error

5% error

O(G) plateau

Errors "push through" G bound

As more k-mers overlap errors, # nodes & edges approach N

k = 30

Average Lambda phage coverageAverage Lambda phage coverage

# D

e Br

uijn

gra

ph e

dges

Page 22: Assembly in Practice: Part 2: DBG

0%

G

10 20 30 40 50

050000

150000

250000

Average coverage

# de

Bru

ijn g

raph

nod

es +

edg

es

1%

Error correction

Same experiment as before, with 5% error added5%

Errors "push through" G bound

As more k-mers overlap errors, # nodes & edges approach N

k = 30

Average Lambda phage coverage

(Now with 1% error added)

# D

e Br

uijn

gra

ph e

dges

Average Lambda phage coverage

Page 23: Assembly in Practice: Part 2: DBG

Error correction

k-mer count histogram:

x axis is an integer k-mer count, y axis is # distinct k-mers with that count

Right: such a histogram for 3-mers of CATCATCATCATCAT:

CAT occurs 5 times

ATC and TCA occur 4 times

Page 24: Assembly in Practice: Part 2: DBG

Error correction

~ 6,100 distinct k-mers occurred exactly 10 times in the input

5 10 15 20 25

02000

4000

6000

8000

10000

k-mer count

# di

stin

ct k

-mer

s w

ith th

at c

ount

Error-free

Draw 20-mers from genome randomly until each 20-mer has been drawn 10 times on average

How would the picture change for data with 1% error rate?

Page 25: Assembly in Practice: Part 2: DBG

Error correction

~ 9.6K k-mers occur once

32 k-mers occur once

k-mers with errors usually occur fewer times than error-free k-mers

5 10 15 20 25

02000

4000

6000

8000

10000

k-mer count

# di

stin

ct k

-mer

s w

ith th

at c

ount

Error-free0.1% error

Page 26: Assembly in Practice: Part 2: DBG

Error correction

Idea: errors tend to turn frequent k-mers to infrequent k-mers, so corrections should do the reverse

Say each 8-mer occurs an average of ~10 times:

GCGTATTACGCGTCTGGCCT

GCGTATTA: 8 CGTATTAC: 8 GTATTACG: 9 TATTACGC: 9 ATTACGCG: 10 TTACGCGT: 10 TACGCGTC: 11 ACGCGTCT: 11 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8

Read:

8-mers:

# times each 8-mer occurs in the reads. “k-mer count profile”

All 8-mer counts are near average, suggesting read is error-free

(20 nt)

Page 27: Assembly in Practice: Part 2: DBG

Error correction

Suppose there’s an error

GCGTACTACGCGTCTGGCCT

GCGTACTA: 1 CGTACTAC: 2 GTACTACG: 1 TACTACGC: 1 ACTACGCG: 2 CTACGCGT: 1 TACGCGTC: 9 ACGCGTCT: 8 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8

Read:

Around average

Below averagek-mer count profile has corresponding stretch of below-average counts

Page 28: Assembly in Practice: Part 2: DBG

Error correction

k-mer counts when errors are in different parts of the read:

GCGTACTACGCGTCTGGCCTGCGTACTA: 1 CGTACTAC: 3 GTACTACG: 1 TACTACGC: 1 ACTACGCG: 2 CTACGCGT: 1 TACGCGTC: 9 ACGCGTCT: 8 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8

GCGTATTACACGTCTGGCCTGCGTATTA: 8 CGTATTAC: 8 GTATTACA: 1 TATTACAC: 1 ATTACACG: 1 TTACACGT: 1 TACACGTC: 1 ACACGTCT: 2 CACGTCTG: 1 ACGTCTGG: 1 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8

GCGTATTACGCGTCTGGTCTGCGTATTA: 8 CGTATTAC: 8 GTATTACG: 9 TATTACGC: 9 ATTACGCG: 9 TTACGCGT: 12 TACGCGTC: 9 ACGCGTCT: 8 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGT: 1 GTCTGGTC: 2 TCTGGTCT: 1

Index

y

Index

y

Index

y

Page 29: Assembly in Practice: Part 2: DBG

Error correction

Count profile indicates where errors are

2 4 6 8 10 12

24

68

10

k-mer position

count

These probably overlap an error

Page 30: Assembly in Practice: Part 2: DBG

Error correction

Simple algorithm, given a count threshold t:

For each read:

For each k-mer:

If k-mer count < t:

Examine k-mer’s neighbors within some Hamming/edit distance. If neighbor has count ≥ t, replace old k-mer with neighbor.

5 10 15 20 25

02000

4000

6000

8000

10000

k-mer count

# di

stin

ct k

-mer

s w

ith th

at c

ount

Error-free0.1% error

Pick t corresponding to dip between the peaks

Page 31: Assembly in Practice: Part 2: DBG

Error correction: implementation excerpt

def correct1mm(read, k, kmerhist, alpha, thresh): ''' Return an error-corrected version of read. k = k-mer length. kmerhist is kmer count map. alpha is alphabet. thresh is count threshold above which k-mer is considered correct. ''' # Iterate over k-mers in read for i in range(0, len(read)-(k-1)): kmer = read[i:i+k] # If k-mer is infrequent... if kmerhist.get(kmer, 0) <= thresh: # Look for a frequent neighbor for newkmer in neighbors1mm(kmer, alpha): if kmerhist.get(newkmer, 0) > thresh: # replace with neighbor read = read[:i] + newkmer + read[i+k:] break return read

Full Python example: http://bit.ly/CG_ErrorCorrect

Page 32: Assembly in Practice: Part 2: DBG

Error correction: results

Corrects 99.2% of errors in an example with 0.1% error added

Before After

0 50 100 150

01000

2000

3000

4000

k-mer count

# di

stin

ct k

-mer

s w

ith th

at c

ount

Error-free0.1% error

0 50 100 150

01000

2000

3000

4000

k-mer count

# di

stin

ct k

-mer

s w

ith th

at c

ount

Error-free0.1% error, corrected

From 194K k-mers occurring exactly once to just 355

Page 33: Assembly in Practice: Part 2: DBG

Error correction: results

5 10 15

60000

100000

140000

Average coverage

# de

Bru

ijn g

raph

nod

es +

edg

es

Error-free1% error, corrected1% error, uncorrectedG bound

Uncorrected, graph size is off the chart

Corrected, graph size is near G bound

Also works for 1% error...#

De

Brui

jn g

raph

edg

es

Average Lambda phage coverage

...provided enough coverage to distinguish frequent/infrequent

Page 34: Assembly in Practice: Part 2: DBG

Error correction

Average coverage & k must be such that we can distinguish frequent from infrequent k-mers

To work well:

k-mer neighborhood explored must be broad enough to find frequent neighbors. Depends on error rate and k.

Data structure for storing k-mer counts should be smaller than the De Bruijn graph

Otherwise, what's the point? 🙂

Alternately, we might give up on correcting and simply remove bad k-mers

Page 35: Assembly in Practice: Part 2: DBG

Data structures for error correction

Bloom filters

Counting quotient filters

Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biology. 2014;15(11):509.

Pandey P, Bender MA, Johnson R, Patro R. Squeakr: an exact and approximate k-mer counting system. Bioinformatics. 2017; btx636.

Don’t need 100% accurate k-mer counts; just have to distinguish frequent and infrequent

CountMin sketchesCrusoe MR, Alameldin HF, Awad S, Boucher E, ..., Brown CT. The khmer software package: enabling efficient nucleotide sequence analysis. F1000 Research. 2015 Sep 25;4:900.

Page 36: Assembly in Practice: Part 2: DBG

Assembly alternatives

Overlap

Layout

Consensus

Error correction

De Bruijn graph

Scaffolding

Refine

Page 37: Assembly in Practice: Part 2: DBG

CGAAAAA

GAAAAAC

AAAAAAC

AAAAACC

AAAAACTAAAAACA

GAAAAAGAAAAAGG

TAAAAAT

AAAAATT

GGAAACA

GAAACAC

AAAACCA

AAAAAAA

AAAAAAG

TGAAAACGAAAACT

AAAAAGA

AAAAGAG

AAAAAGC

AAAAGCC

AAAAGGC

GAAAATGAAAATGT

TAAAATT

AAAATTT

GTATCTATATCTAC

AAACACA

CGCCATT

GCCATTA

AAAACAT

AAACATG

AAACCAT

TGACGGTGACGGTA

AAAACCGAAACCGC

AGCCCGC

GCCCGCA

GAAACGC

AAACGCA

AGAACGTGAACGTT

AAAACTG

AAACTGC

AAAACTTAAACTTT

AAGGTAA

AGGTAAC

AAAGAGT

GAAAGCA

AAAGCAA

AAAGCAG

AAAGCCC

AGAAGCG

GAAGCGC

AAAGGCG

TGAAGTT

GAAGTTC

GAAGTTA

CAAATAA

AAATAAA

TAAATAC

AAATACT

CAGCGAT AGCGATG

AAAATCA

AAATCAC

AAATCAT

CTTAGGT

TTAGGTC

GGAAAGC

CAAATGC

AAATGCA

AAATGTC

AAATGTA

AAAATTA

AAATTAC

GGAATTT

GAATTTG

AAACAAAAACAAAG

TAACAAC

AACAACC

AACAACG

AAGCGCG

AGCGCGT

AACACAG

AGAAAAA

GAAAAAA

AGACAGA

GACAGAT

CTAAATA

TGACAGTGACAGTG

CTGGTAA

TGGTAACCAACATC

AACATCC

AACATGT

TGACATTGACATTG

CGACCAA

GACCAAA

TGACCAC

GACCACA

GACCACC

CGACCAG

GACCAGG

CAACCAT

AACCATC

AACCATG

ATGGTGC

TGGTGCT

GTGGGAC

TGGGACG

CTGACGG

TGACGGG

TGACCTGGACCTGG

CCCCGCC

CCCGCCA

AAAAAATAAAAATG

CTGCTGG

TGCTGGC

TGCTGGT

TGACGAG

GACGAGA

CAACGATAACGATG

GGACGCA GACGCAA

TGACGCG

GACGCGT

GGACGCTGACGCTA

CAACGGG

AACGGGC

TAACGGT

AACGGTG

GAACGTA

AACGTAT

AAAACAGAAACAGC

ACGGTTC

CGGTTCC

CGACTAC

GACTACT

CGCAATGGCAATGG

CAGGTTG

AGGTTGC

GGACTCGGACTCGC

TAACTGCAACTGCC

GAACTGG

AACTGGT

TGACTTA

GACTTAG

GAACTTC

AACTTCT

AACTTTC

CGAGAAAGAGAAAA

GAGAAAT

GCGTAAACGTAAAT

TGCAACG

GCAACGG

AACCGGTACCGGTC

ACCACCA

CCACCAT

CCACCAC

CACCGTAACCGTAT

AAGAGTG

TAAATTA

AAATTAA

AAGCAAT

TAGTGCG

AGTGCGG

AAGCAGA

CTGCGATTGCGATG

AAGCCCG

CAAGCCG

AAGCCGC

ACACAGA

CACAGAA

GTAACGG

CGAGCTG

GAGCTGG

AGAAAAT

GTTCTGATTCTGAT

CGAACGTGAACGTC

AGAGGCAGAGGCAG

AAGGCGA

AAAGGTA

GAAGGTTAAGGTTT

TGAGTAA

GAGTAAA

AGAGTAC

GAGTACA

TGAACTG

ACCGCTG

CCGCTGA

CCTATATCTATATA

CGAGTGTGAGTGTT

CGAGTTAGAGTTAG

AAGTTCG

AATAAAA

CGAGGTA

GAGGTAA

AATACTT

ACATCAG

CATCAGT

TGATAGC

GATAGCA

GTCCTCT

TCCTCTC

CAATATAAATATAGCAATATC

AATATCA

CAATATGAATATGT

CGATATT

GATATTC

AATCACC

AAACAGTAACAGTT

GAATCCA

AATCCAC

CAATCCGAATCCGC

TTCCTGCTCCTGCG

CGATCGC

GATCGCC

ACGTATTCGTATTT

GAATCTA

AATCTAC

CAAAGCC

TAATGAA

AATGAAA

TGCGGGT

GCGGGTT

GCGGGTG

TAATGAGAATGAGA

CGATGAT

GATGATT

AATGCAG

CGATGCC

GATGCCG

CACACCAACACCAT

CTTTTTT

TTTTTTT

TGATGGC

GATGGCT

ACGGGGCCGGGGCT

AATGTCG

AATTAAA

AATTACA

GGCGCTGGCGCTGT

GAATTCCAATTCCG

AAATTCGAATTCGG

CAAAGGCAAAGGCA

TGATTGA

GATTGAA

TGATTTAGATTTAG

GGATTTC

GATTTCC

AATTTGC

AAATTTT

AATTTTA

GACAAAAACAAAAG

GGCAAAT

GCAAATG

TACAACA

ACAACAG

ACAACCA

GGCAAGC

GCAAGCC

ATGTCCT

TGTCCTG

GGCAATA

GCAATAT

AGCAATG

GCAATGC

CACAATT

ACAATTG

TACACAA ACACAAC

TCGTCGA

CGTCGAC

AACACATACACATT

AGCACCA

GCACCAC

GCACCAG

TAGCGAGAGCGAGA

CGCACCGGCACCGA

CGCACCT

GCACCTG

GACACGCACACGCC

CACCGAAACCGAAA

GTCGACCTCGACCA

ACAGAAA

CACAGAC

ACAGACA

TACAGAG

ACAGAGT

ACAGATA

AGCAGCGGCAGCGT

CGCAGCTGCAGCTG

TACAGGA

ACAGGAA

GGCAGGG

GCAGGGG

GCAGGGC

CACAGGT

ACAGGTA

TGCAGTAGCAGTAC

ACAGTGC

GACATACACATACA

GGCATAGGCATAGC

TACATCA

ACATCCA

CCGTTAT

CGTTATC

CACATGAACATGAT

GCTATATCTATATG

TGCATGG

GCATGGC

ACATGTC

CGCATTA

GCATTAG

CGCATTC

GCATTCC

GCATTCG

GGCATTGGCATTGC

TGAATCC

CGCCAAA

GCCAAAA

CACCAAC

ACCAACC

AACCAAT

ACCAATA

TACCACA

ACCACAG

CACCACC

TAAAAACCACCACT

ACCACTA

TACCAGCACCAGCA

ACCAGGA

CGAATCT

ACCATCT

ACCATGC

AACCATT

ACCATTA

TACCCAA

ACCCAAT

TACGGCCACGGCCG

CGCCCAG

GCCCAGC

TGCCCCC

GCCCCCA

GCCCCCG

GAAATGC

CACCCGCACCCGCC

ACCCGCA

TGCCCGG

GCCCGGA

AGCGATTGCGATTA

CGTCGAAGTCGAAT

CGCCCTGGCCCTGC

CAGTGCC

AGTGCCC

TGCCGAA

GCCGAAC

TACCGAGACCGAGG

GGAATGTGAATGTG

AGCCGCA

GCCGCAT

CGCCGCC

GCCGCCC

GCCGCCG

ACGGGCA

CGATCCG

GATCCGG

GTCAGCG

TCAGCGA

GGCCGGC

GCCGGCG

AGCCGGG

GCCGGGA

TGCCGGT

GCCGGTA

GCCGGTC

CGCCGTA

GCCGTAT

TACCGTCACCGTCA

TGCCGTG

GCCGTGA

GCCGTGG

TACCGTT

ACCGTTA

ACCGTTG

TTAATTGTAATTGA

TACCTCG

ACCTCGA

CAGTGGC

AGTGGCA

CACCTGA

ACCTGAC

ATCGCCA

TCGCCAT

TGCCTGGGCCTGGT

GGCGCGTGCGCGTT

CAGCGTGAGCGTGT

CACCTTTACCTTTA

GCGGGTCCGGGTCG

TCACAAT

TCGTGGT

CGTGGTC

CGTGGTT

CCCGCCG

CCGCCGT

TAAAAAA

CACAACA

GGCGAGA

GCGAGAA

GGCGAGC

GCGAGCT

AACGAGG

ACGAGGT

TGCGAGT GCGAGTG

AGCGATAGCGATAA

TAGCGCG

GCGATGA

GCGATGC

ACGGTGC

CGGTGCG

ACGCAAC

AGCGCAC

GCGCACA

AACGCAT

ACGCATT

CAAAAGCAAAAGCA

CCGTGTGCGTGTGG

CGCGCCG

GCGCCGA

TACGCGC

ACGCGCC

TTACACA

AACGCGTACGCGTG

AACGCTG

ACGCTGC

TACGGAGACGGAGT

ACGCGTA

TGCGGCA

GCGGCAA

GCGGCAG

GGCGGCCGCGGCCT

GGCGGCGGCGGCGT

GGCGGGAGCGGGAC

TGCGGGC

GCGGGCT

TACACAGACACAGT

TTTTTTC

TTTTTCG

GGCGGTA

GCGGTAC

AGCGGTCGCGGTCA

AACGGTT

ACGGTTA

GCACAGA

TGCGCTG

GCGCTGA

CATTACC

ATTACCT

ATTACCA

GGCGTGT

GCGTGTT

AACGTTA

ACGTTAC

GTACAGG

AACGTTTACGTTTT

CACTAAA

ACTAAAT

AAGTTTCAGTTTCG

GAGTTCG

AGTTCGG

TGAAAAA

ACTACTC

CTCTGCC

TCTGCCC

AGCTATGGCTATGA

TTACATCTACATCG

CCGTTGA

CGTTGAT

TACTCCAACTCCAC

TACTGAAACTGAAA

TACTCCG

ACTCCGC

AACTCCTACTCCTG

ACTCGCC

TGCAGAA

GCAGAAC

CAACATTAACATTC

AGCTGGT

GCTGGTG

TGCTGAA

GCTGAAT

GGCTGAC

GCTGACG

GCTGACC

GACTGAGACTGAGG

CGCTGAT

GCTGATT

GCTGATC

GACTGCA

ACTGCAA

CACTGCC

ACTGCCG

ACTGCCT

CGCTGCG

GCTGCGC

GCTGCGG

TTACCAC

TACCACC

CGCTGGC

GCTGGCA

ACTGGTT

TCCGCCCCCGCCCC

AACACCAACACCAC

AGCTGTTGCTGTTG

GTGCTGG

GACGGGA

ACGGGAC

ACTTAGG

CACCATC

CACCATT

CAGTTTG

AGTTTGT AGTTTGC

ACTTCTG

TCGCCGT

ACAGCGA

CAGAGTG AGAGTGT

TGCTTTA

GCTTTAC

ACTTTCG

TAAAACA

GGCTTTT

GCTTTTT

ATAAAAC

ATAAAAA

GGGAAATGGAAATA

CCACCCG

CAGAACG

GGGAAGCGGAAGCA

AGGAAGGGGAAGGG

CAAAATC

AGGAATT

CAGACAG

AGTAAAT

GTAAATT

ACCAAAG

TGCGGTG

GCGGTGC

GGGACGC

GGACGCG

GGGACTAGGACTAG

GGGACTC

GGAACCAGAACCAC

TTACCGT

CGCGTACGCGTACA

AGTAACTGTAACTT

CAGAGTA

ATGCCAATGCCAAG

TTACCTC

CAGATAA

AGATAAA

AGATAAT

CGGATAG

GGATAGC

TTACCTG

TACCTGC

CTGACGC

TGGATTA GGATTAA

GGGATTCGGATTCC

TGCCCGC

GCCCGCG

GGGATTT

TTAAAATTAAAATA

CGGCAAG

GGGCAAT

TGGCACAGGCACAA

TAGCACC

TGGCACGGGCACGA

TAGCAGC

AGCAGCT

AGGCAGG

TGGCAGTGGCAGTG

AGGCATA

GGCATAC

TAGCATC AGCATCA

CGGCCAA

GGCCAAA

TGGCCAC

GGCCACC

GTGGTCATGGTCAC

GAGCCGAAGCCGAT

TGGCCGG

CTGAGTGTGAGTGC

CGCGTGG

GCGTGGT

CGGCCTGGGCCTGT

AGGCGAG

TAGCGATTAGCGCA

CAGCGCG

CACCGCCACCGCCA

TGGCGGC

TTATCACTATCACC

CGGCGGT

CGTACAG

CGGCGTG

TGCTGAT

GCTGATG

GCCCGAACCCGAAT

TTCGCTGTCGCTGG

GGGCTCTGGCTCTT

GGGCTGA

ACCTGGTCCTGGTA

TTTACCC

TTACCCA

CAGGAAA

AGGAAAC

ACCCCGACCCCGAA

CAACGGT

TCTACCG

CTACCGT

CGGGACT

CAGGATAAGGATAG

CAGGATG

AGGATGC

CGGGATT

CGGGCAA

GGGGCAG

GGGCAGG

GGGCAGT

TAGGCAT

CGGGCCTGGGCCTT

CGGGCTG

CGGGCTT

GGGCTTT

CACCGCT

AGGGGCA

GGGGGCCGGGGCCT

CAGGGGC

CCTATTTCTATTTT

CTACTCC

CAGGTAA

GAGGTACAGGTACA

GGGGTAGGGGTAGC

TAGGTCA

AGGTCAC

TGGGTCA

GGGTCAT

GGGTCAG

GGGGTGAGGGTGAC

CAGGTGG

AGGTGGC

TATACTT

ATACTTT

TGGGTTAGGGTTAT

CAGGTTCAGGTTCT

CGGGTTG

GGGTTGC

GGTAACG

GGTAACA

GGTAACC

GAGTAAGAGTAAGT

CGGTAAT

GGTAATG

AGTACAC

AGTACAT

TTGAGGCTGAGGCA

TGGTAGC

GGTAGCG

GGTCACT

GGTCACC

CGGTCAG

GGTCAGG

GGTCATT

TGACTGC

CGGTCGA

GGTCGAA

TGGTCGTGGTCGTT

ATCGATC

TCGATCC

CCAGGAA

GCAGCTT

GTAGCGA

GGTGCGG

GGTGCTG

TAACCAA

GGTGGCC

CAGTGGG

AGTGGGT

TGGTGGT

GGTGGTT

GAGTGTC

AGTGTCT

CAGTGTGAGTGTGG

AGTGTTG

TGGTTAC

GGTTACC

TCACCACCACCACA

GGTTCCG

TGGTTCT

GGTTCTG

GGTTGCC

GGGTTGTGGTTGTC

TGGTTTC

GGTTTCA

CCAACCA

AGCAGGAGCAGGAA

GATAAAA

CATAAACATAAACG

ATTGAAA

CTCCGCT

TCCGCTG

GATAACAATAACAA

AATAACCATAACCA

GTAACGA

ATAGAAC

TAGAACA

GTAATGA

AATACAAATACAAC

GTACACA

CCATTAC

GGTACAT

GTACATC

GCCTGTGCCTGTGT

AATAGAA

ATAGCAT

ATAGCAG

CATAGCG

ATAGCGC

TATAGGCATAGGCA

GATAGTAATAGTAT

CGCAACG

CTGCCGG

AGTATACGTATACA

ATATAGG

ACGTCACCGTCACT

ATATCAG

CTCTCTG

TCTCTGC

TATATCGATATCGG

GTAACAA

ATATGTC

ATATTCT

GATATTG

ATATTGC

GTATTTT

ACTGCACCTGCACC

CATCAAC

ATCAACG

CAAAGGT

GGTCACA GTCACAA

ATCACCA

TTCTGAC

TCTGACG

TCTGACT

GTCAGGA

ATCAGTG

ATCAGTT

TGCCGAGGCCGAGA

GTCATTA

ATCCACC

ATCCACG

CCACCGT

CACCGTC

CATCCAT

ATCCATG

AGCTTTA

GATCCCGATCCCGA

ACCGTCC

ATCCGGT

CGTTCCGGTTCCGG

CGTCCTC

GTCCTGC

CGCTGCCGCTGCCC

GTCGAAA

TCACCAT

TATCGAT

CCAGCACCAGCACC

CCGTCCT

TGTCGGGGTCGGGG

AACTGCT

ACTGCTG

GATCGTAATCGTAT

TTCACCA

TCACCAA

CGTATTCGTATTCC

ATCTACC

CGTATTG

GTATTGC

GCTGGCG

TGTCTGA

GTCTGAT

ATCAGGTTCAGGTA

CATCTGG

ATCTGGT

CATCTGTATCTGTG

TGTCTTAGTCTTAG

TTTTCTG

TTTCTGC

CTTTCGT

ATGAAAA

ATGAAAG

GTTATGGTTATGGA

GTGCTGATGCTGAC

CGTGAGT

GTGAGTA

GGTGATGGTGATGG

ATGATTG

ATGCAGA

TGTCAATGTCAATC

AATGCCA

ATGCCAG

GTGCCCG

ATGCCGA

CATGCGA ATGCGAG

GTGCGGG

CATCACC

AATGGAAATGGAAA

TGTGGAT GTGGATT

GTGGCAA

GTGGCCA

CGTGGCG

GTGGCGA

ATGGCTG

GTGGGTCCATGGTG

GTGGTTC

GTCACTA

ACGGGTT

CATGTCC

ATGTCGA

TATGTCT

ATGTCTC

CACTCCG

GGGATTAGGATTAC

TGTGTGG

GTGTGGA

CGTGTTA

GTGTTAG

GTGTTGA

GATTAAAATTAAAA

ATTACAG

CGTTACC

GTTACCG

TATTACGATTACGG

CATTACTATTACTA

TGTTAGA

GTTAGAA

CATTAGC

ATTAGCA

ATTAGCG

AGCTTCT

GCTTCTG

AGTTATAGTTATAG

GTTATCG

CATTATG

ATTATGG

CGGTACA

CATTCCG

ATTCCGG

CCAGGCA

CAGGCAG

GTTCGGC

GGTTCTCGTTCTCG

TATTCTG

ATTCTGG

GTTTCGCTTTCGCT

TGTTGAA

GTTGAAG

TATTGACATTGACT

GTTGATA

GTTGCCG

TATTGCG

ATTGCGG

TATTGCT

ATTGCTG

TACTTTAACTTTAA

TTGATAT

TCCGGTC

CCGGTCG

TTACCCGTACCCGA

AGGCTGA

TGTTTACGTTTACG

GTTTCAC

ATTTCCG

AGTTTGAGTTTGAT

GATTTGCATTTGCC

GTTTGTT

ATTTTAT

CGTTTTC

GTTTTCT

TTTGCCG

TTGCCGA

TTGCCGT

TATTTTT

ATTTTTG

TTAAAAA

CCAAAAT

ACAAAGC

CCAAAGG

CCAAATA

TGGAAAG

CAACAGG

ACAACAT

TTAACCA

TAACGAG

TAACGAC

TCAACGC

CAACGCT

CAACGCC

ACAACGTCAACGTT

AATCCGG

CGCCGAT

GCCGATT

TTCCGCT

CGAGTAA

GACTTCT

CCGAACT

CGAACTT

GCGCGTG

TATCAGC

ATCAGCG

GCAATACCAATACG

TAACCGTAACCGTG

CCAATAT

AGCTTTT

GCTTTTC

CAATGCC

TTAATTATAATTAA

CCGAAGGCGAAGGT

CAATTGA

TCACAACCACAACG

GCACAAT

TTACAGA

TACAGAT

CAGAATAAGAATAC

CCACAGG

TCAGTCTCAGTCTG

GTACATATACATAA

CCACATG

CACATGG

TCGAATC

ACACCAGCACCAGC

TGTCGAT

GTCGATC

CCTGCCG

CTGCCGT

GCAGTGC

TTACCGCTACCGCT

GCAGTGG

GGATGCT

GATGCTT

TAGCGGT

AATTGAA

TTACGCG

TATTGATATTGATG

TTACGGTTACGGTG

TTACGTCTACGTCG

TCACTAA

CACTAAC

TGAAACG

GCACTATCACTATT

TACTCCC

GTTCCGA

TCACTGC

CACTGCA

ATACTGGTACTGGA

CAGAAAA

TTAGAAG

TAGAAGC

AGTCGGTGTCGGTC

CGATGGCGATGGCC

ATAGCAATAGCAAC

TTAGCAC

CCAGCCACAGCCAA

CCAGCCG

CAGCCGG

GCGACGCCGACGCC

TTAGCGG

TCGAAAA

GTGACGG

CAGCTTC

CAGGAAT

TCAGGAT

CAGGATC

CACATCA

CATTCTG

ATTCTGA

ACGACTA

ATATTACTATTACA

GCAGGTG

GCAGGTT

GCAGTGACAGTGAG

CAGTGCG

TCAGTGG

ACAGTTGCAGTTGA

TCAGTTT

ATATAAATATAAAA

CGATTCTGATTCTA

TGGCTGG

GGCTGGT

TGTCTCT

GTCTCTG

GCATCAA

TTCTGCG

TTATCGA

TGATATT

GTTCTGG

TTCTGGG

CCATCTG

TCGAGCCCGAGCCG

CCATGAA

CATGAAA

CATGAAC

ACATGACCATGACA

ACATGCACATGCAG

CCATGCG

CCATCAC

CATGGCA

CATGGCT

ACATGGT

GCATCAG

ATTCCGCTTCCGCC

CCCAACGCCAACGT

TCATTAC

CCATTAG

CCATTAT

CATTATC

GGTGCCGGTGCCGA

TCATTCT

TTATTGACGTCTTT

GTCTTTT

ATGAAAC

CCCAAAT

TCTGAAC

CTGAACT

GCGAGTACCCAATACCAATAG

GTCAATTTCAATTA

ACCACAT

TCACCAG

TCCACCC

CCACCCT

GCCACCG

GCCTTTTCCTTTTT

ACCACGA

CCACGAA

CCACGAC

CCATGGT

TTCACTG

TCTGACCCTGACCG

ACCAGAGCCAGAGG

CCCAGCC

GACATCA

GCCAGGC

TTGACTT

TTGACTC

TTGACTG

GTCATAATCATAAC

ACCATCA

CCATCAG

TCCATGA

CTCATGGTCATGGT

TTCATTC

CTCGCCG

TCGCCGC

GCCCAAA

AATGAGTATGAGTG

GCCCACGCCCACGC

CTTCTGA

CTTCTGG

CAACGCAAACGCAG

ATGGCAT

CCCCCAC

CCCCCGC

TGGCAAA

TATCCATATCCATC

ATGATGG

AATGCCC

TTCGGCG

TCGGCGG

TTCCGAC

TCCGACT

TCCGACA

CCCGCAC

CCCGCGC

GGCATCA

CCCGGAT

CCCGGAC

TTCCGGC

TCCGGCT

TTCCGTCTCCGTCG

TTATGGC

TATGGCC

TCCTGCA

AGTCACT

TTTGCCC

TTGCCCACCGAACG

TTCGAAGTCGAAGC

CTCGAAT

TTCGACC

CCGACTA

GCCGATA

CCGATAT

TCGATCG

GTTGCGGTTGCGGA

CCGATTG

CCGCACC

CCGCATT

CCGCCAA

CCGCCCA

CCGCCGC

CTGCGCT

AGGCACA

GCCGCGGCCGCGGT

CGCATGGGCATGGT

CCCCCGACCCCGAC

TTGCGGC

CCGCTGG

CCGCTGC

CCGGATA

CCGGCGT

CCGGCTG

CCGGGAT

CCGGTAA

CCGTATT

TTGCTGA

TTCGTCG

GGACACAGACACAC

CCGTGAG

CCGTGGC

GCCTGTT

GGCAGGT

TGCCGAT

ATGCTTT

TGGCAAGGGCAAGA

ACCTGCC

TCTGCGG

CTGCGGG

TGGCATC GGCATCC

GCCTCAACCTCAAC

GCTGACA

CTGACAG

TACAAAAACAAAAC

ACCTCCACCTCCAA

ATCTCCG

TCTCCGC

CCTCGAA

ACTGGCA

CTGGCAG

CCTCTCT

TCTCTGT

ATGGCCG

TTCTGAA

CCTGACA

TCTGATA

CCTGCAT

TTCTGGA

TCTGGAA

AAATTTAAATTTAT

TCTGGGA

TCTGGTA

CTCTGTG

TCTGTGT

CCTGTTT

CTAAGGTTAAGGTA

TGATTTG

TTGAAAA

TGCCCAA

GGATTTTGATTTTT

GTGACGA

GTGCCCC

GCGCAGTCGCAGTG

CTGCCCG

TGAACTT

TTGAAGATGAAGAA

GCTGGTA

TTGAAGT

CTGAATC

GCCAAATCCAAATT

TCGACCGCGACCGA

CTGACGATGACGAC

CTGGTTA

CTGACTG

AGCGCCAGCGCCAT

GGCAACG

CTGAGGC

TGAGGCG

AATGTAGATGTAGG

ATGATAATGATAAA

CTCAAATTCAAATA

CTGATAG

CTGATGG

CCGATGTCGATGTT

GAGCCGGAGCCGGT

CTGATTT

CGCACAG

TTGCGACTGCGACA

GTGTCTG

ACCAAACCCAAACG

TGGCGAG

CTGCATG

TGCCAGG

CGCCCAA

CTGCCCC

TGCCGGG

CCGCCTACGCCTAA

GCACAAA

CCGCGATCGCGATG

GCTGGTT

CTGCGGT

CCGCTCACGCTCAG

ATTGTTG

TTGTTGG

ACGCTGA

CTGGAAA

TTGCGGTTGCGGTA

ACGACCACGACCAC

CTGGCAC

CGGCTGA

CTGGCCCTGGCCCC

ACGGCGCCGGCGCG

GTTACCT

ACGGGAGCGGGAGT

ACGCCGCCGCCGCA

TTGGGGC

TGGGGCA

GCGGGTACGGGTAA

CTGGTAG

AATTAGC

ACCGATCCCGATCC

CGGTGCT

CTGGTGG

TGGTGGC

ACGGTTGCGGTTGA

CTGGTTT

GCGGCTG

GTTCGAATTCGAAA

ACGTAATCGTAATT

CTGGCGATGGCGAC

GGCAGAGGCAGAGG

GCCAAAG

CGCAGATGCAGATT

CCACTAG

CCACTAT

CAGCGGGAGCGGGC

TTTTATTTTTATTG

GTAACGTTAACGTA

TGTCTGC

GCGTCTTCGTCTTA

CAGCGTCAGCGTCC

CCGTGATCGTGATC

TGGAAGCGGAAGCG

TTGTGCCTGTGCCG

TTTCACT

AGTGAAAGTGAAAA

CTGTGTG

AATATTGATATTGG

CTGCAAC

TGCAACT

TGTTGAC

TGTTGGG

CTGTTTA

TTGTTTCTGTTTCG

CTGTTTGTGTTTGG

GTGGAAATGGAAAA

GGGTCGAGGTCGAC

TTTAACC

TACAGGT

TTGTTAG

TGCGCCG

TTTACGC

AGTGCTG

CCGGTTG

TGGCCGCGGCCGCC

TTTCGTC

TTAGCAA

TTAGCGT

TAACCACAACCACG

GTTAGTATTAGTAG

CTTATAGTTATAGC

TGTTGGCGTTGGCG

TTATCGG

GTTATGATTATGAA

TGCATCCGCATCCA

ACTCAACCTCAACA

TTTCACATTCACAG

GAGGCGT

CCTCAGACTCAGAT

TCGGATA

CTGACCA

TTTCATT

GTTCCAATTCCAAC

CGTAGCG

ACTCCCCCTCCCCG

GTCATGCTCATGCA

TTTCCGC

TTTCCTCTTCCTCG

TTTTGCC

TTTCGAC

GCCACCT

CTTCGGATTCGGAC

GTTCGGGTTCGGGC

CAACTTTAACTTTA

GTCAAAATCAAAAA

GTCATTC

AAAACAAAAACAAT

GTTGGGG

CCTGAAC

CTGATGT

AGTGCGC

CTTGCGATTGCGAG

GGCGAGT

CTGGCGG

ACTGGACCTGGACT

GCTGGATCTGGATT

ACTTTACCTTTACC

CTGGGAC

TTGGGGT

AACCAGCACCAGCT

AATTTCAATTTCAG

ATTGTGGTTGTGGC

TTTGTTG

TTGTTGA

CTTTAAATTTAAAC

CTTTAAC

ACTTAAGCTTAAGT

CCTAGGTCTAGGTC

CTTTTCTTTTTCTT

GGGACCCGGACCCG

AACAATTACAATTA

TTTTCAT

TTTTCGA

TTTTCGGTTTCGGC

AATTTGTATTTGTC

TTTTGCATTTGCAG

TGCCCACGCCCACA

GCTTGGCCTTGGCA

CCTTCTG

CCTTTAGCTTTAGC

TTTTATC

TTTTATG

CTTTTCA

TCAGTGC

TTTTTGC

TTTTTGG

CTTTTTC

"Island"

"Tip"

Error correction should remove most tips & islands; rest can be removed here, leveraging graph structure

Page 38: Assembly in Practice: Part 2: DBG

Maternal

Paternal

GCATA

CATAT

TCGGT

CGGTA

CGGCA

GGCAT

GCGCC

CGCCG

ATGCG

TGCGC

ATATGGTATA

TATAT

GTAGT

TAGTC

GGTAT

AGTCT

TATGC

TCGGC

TCTCG

CTCGG

GTCTCGTAGTCTCGGCATATGCGCCG GTAGTCTCGGTATATGCGCCG

Maternal

Paternal

k=6

"Bubble"

Page 39: Assembly in Practice: Part 2: DBG

Assembly alternatives

Overlap

Layout

Consensus

Error correction

De Bruijn graph

Scaffolding

Refine

Remove remaining "islands" “tips” and “bubbles” so that contigs are more obvious