Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Assembly in Practice: Part 2: DBGBen Langmead
Department of Computer Science
Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]).
Assembly alternatives
Alternative 1: Overlap-Layout-Consensus (OLC) assembly
Alternative 2: De Bruijn graph (DBG) assembly
Overlap
Layout
Consensus
Error correction
De Bruijn graph
Scaffolding
Refine
CGAAAAA
GAAAAAC
AAAAAAC
AAAAACC
AAAAACTAAAAACA
GAAAAAGAAAAAGG
TAAAAAT
AAAAATT
GGAAACA
GAAACAC
AAAACCA
AAAAAAA
AAAAAAG
TGAAAACGAAAACT
AAAAAGA
AAAAGAG
AAAAAGC
AAAAGCC
AAAAGGC
GAAAATGAAAATGT
TAAAATT
AAAATTT
GTATCTATATCTAC
AAACACA
CGCCATT
GCCATTA
AAAACAT
AAACATG
AAACCAT
TGACGGTGACGGTA
AAAACCGAAACCGC
AGCCCGC
GCCCGCA
GAAACGC
AAACGCA
AGAACGTGAACGTT
AAAACTG
AAACTGC
AAAACTTAAACTTT
AAGGTAA
AGGTAAC
AAAGAGT
GAAAGCA
AAAGCAA
AAAGCAG
AAAGCCC
AGAAGCG
GAAGCGC
AAAGGCG
TGAAGTT
GAAGTTC
GAAGTTA
CAAATAA
AAATAAA
TAAATAC
AAATACT
CAGCGAT AGCGATG
AAAATCA
AAATCAC
AAATCAT
CTTAGGT
TTAGGTC
GGAAAGC
CAAATGC
AAATGCA
AAATGTC
AAATGTA
AAAATTA
AAATTAC
GGAATTT
GAATTTG
AAACAAAAACAAAG
TAACAAC
AACAACC
AACAACG
AAGCGCG
AGCGCGT
AACACAG
AGAAAAA
GAAAAAA
AGACAGA
GACAGAT
CTAAATA
TGACAGTGACAGTG
CTGGTAA
TGGTAACCAACATC
AACATCC
AACATGT
TGACATTGACATTG
CGACCAA
GACCAAA
TGACCAC
GACCACA
GACCACC
CGACCAG
GACCAGG
CAACCAT
AACCATC
AACCATG
ATGGTGC
TGGTGCT
GTGGGAC
TGGGACG
CTGACGG
TGACGGG
TGACCTGGACCTGG
CCCCGCC
CCCGCCA
AAAAAATAAAAATG
CTGCTGG
TGCTGGC
TGCTGGT
TGACGAG
GACGAGA
CAACGATAACGATG
GGACGCA GACGCAA
TGACGCG
GACGCGT
GGACGCTGACGCTA
CAACGGG
AACGGGC
TAACGGT
AACGGTG
GAACGTA
AACGTAT
AAAACAGAAACAGC
ACGGTTC
CGGTTCC
CGACTAC
GACTACT
CGCAATGGCAATGG
CAGGTTG
AGGTTGC
GGACTCGGACTCGC
TAACTGCAACTGCC
GAACTGG
AACTGGT
TGACTTA
GACTTAG
GAACTTC
AACTTCT
AACTTTC
CGAGAAAGAGAAAA
GAGAAAT
GCGTAAACGTAAAT
TGCAACG
GCAACGG
AACCGGTACCGGTC
ACCACCA
CCACCAT
CCACCAC
CACCGTAACCGTAT
AAGAGTG
TAAATTA
AAATTAA
AAGCAAT
TAGTGCG
AGTGCGG
AAGCAGA
CTGCGATTGCGATG
AAGCCCG
CAAGCCG
AAGCCGC
ACACAGA
CACAGAA
GTAACGG
CGAGCTG
GAGCTGG
AGAAAAT
GTTCTGATTCTGAT
CGAACGTGAACGTC
AGAGGCAGAGGCAG
AAGGCGA
AAAGGTA
GAAGGTTAAGGTTT
TGAGTAA
GAGTAAA
AGAGTAC
GAGTACA
TGAACTG
ACCGCTG
CCGCTGA
CCTATATCTATATA
CGAGTGTGAGTGTT
CGAGTTAGAGTTAG
AAGTTCG
AATAAAA
CGAGGTA
GAGGTAA
AATACTT
ACATCAG
CATCAGT
TGATAGC
GATAGCA
GTCCTCT
TCCTCTC
CAATATAAATATAGCAATATC
AATATCA
CAATATGAATATGT
CGATATT
GATATTC
AATCACC
AAACAGTAACAGTT
GAATCCA
AATCCAC
CAATCCGAATCCGC
TTCCTGCTCCTGCG
CGATCGC
GATCGCC
ACGTATTCGTATTT
GAATCTA
AATCTAC
CAAAGCC
TAATGAA
AATGAAA
TGCGGGT
GCGGGTT
GCGGGTG
TAATGAGAATGAGA
CGATGAT
GATGATT
AATGCAG
CGATGCC
GATGCCG
CACACCAACACCAT
CTTTTTT
TTTTTTT
TGATGGC
GATGGCT
ACGGGGCCGGGGCT
AATGTCG
AATTAAA
AATTACA
GGCGCTGGCGCTGT
GAATTCCAATTCCG
AAATTCGAATTCGG
CAAAGGCAAAGGCA
TGATTGA
GATTGAA
TGATTTAGATTTAG
GGATTTC
GATTTCC
AATTTGC
AAATTTT
AATTTTA
GACAAAAACAAAAG
GGCAAAT
GCAAATG
TACAACA
ACAACAG
ACAACCA
GGCAAGC
GCAAGCC
ATGTCCT
TGTCCTG
GGCAATA
GCAATAT
AGCAATG
GCAATGC
CACAATT
ACAATTG
TACACAA ACACAAC
TCGTCGA
CGTCGAC
AACACATACACATT
AGCACCA
GCACCAC
GCACCAG
TAGCGAGAGCGAGA
CGCACCGGCACCGA
CGCACCT
GCACCTG
GACACGCACACGCC
CACCGAAACCGAAA
GTCGACCTCGACCA
ACAGAAA
CACAGAC
ACAGACA
TACAGAG
ACAGAGT
ACAGATA
AGCAGCGGCAGCGT
CGCAGCTGCAGCTG
TACAGGA
ACAGGAA
GGCAGGG
GCAGGGG
GCAGGGC
CACAGGT
ACAGGTA
TGCAGTAGCAGTAC
ACAGTGC
GACATACACATACA
GGCATAGGCATAGC
TACATCA
ACATCCA
CCGTTAT
CGTTATC
CACATGAACATGAT
GCTATATCTATATG
TGCATGG
GCATGGC
ACATGTC
CGCATTA
GCATTAG
CGCATTC
GCATTCC
GCATTCG
GGCATTGGCATTGC
TGAATCC
CGCCAAA
GCCAAAA
CACCAAC
ACCAACC
AACCAAT
ACCAATA
TACCACA
ACCACAG
CACCACC
TAAAAACCACCACT
ACCACTA
TACCAGCACCAGCA
ACCAGGA
CGAATCT
ACCATCT
ACCATGC
AACCATT
ACCATTA
TACCCAA
ACCCAAT
TACGGCCACGGCCG
CGCCCAG
GCCCAGC
TGCCCCC
GCCCCCA
GCCCCCG
GAAATGC
CACCCGCACCCGCC
ACCCGCA
TGCCCGG
GCCCGGA
AGCGATTGCGATTA
CGTCGAAGTCGAAT
CGCCCTGGCCCTGC
CAGTGCC
AGTGCCC
TGCCGAA
GCCGAAC
TACCGAGACCGAGG
GGAATGTGAATGTG
AGCCGCA
GCCGCAT
CGCCGCC
GCCGCCC
GCCGCCG
ACGGGCA
CGATCCG
GATCCGG
GTCAGCG
TCAGCGA
GGCCGGC
GCCGGCG
AGCCGGG
GCCGGGA
TGCCGGT
GCCGGTA
GCCGGTC
CGCCGTA
GCCGTAT
TACCGTCACCGTCA
TGCCGTG
GCCGTGA
GCCGTGG
TACCGTT
ACCGTTA
ACCGTTG
TTAATTGTAATTGA
TACCTCG
ACCTCGA
CAGTGGC
AGTGGCA
CACCTGA
ACCTGAC
ATCGCCA
TCGCCAT
TGCCTGGGCCTGGT
GGCGCGTGCGCGTT
CAGCGTGAGCGTGT
CACCTTTACCTTTA
GCGGGTCCGGGTCG
TCACAAT
TCGTGGT
CGTGGTC
CGTGGTT
CCCGCCG
CCGCCGT
TAAAAAA
CACAACA
GGCGAGA
GCGAGAA
GGCGAGC
GCGAGCT
AACGAGG
ACGAGGT
TGCGAGT GCGAGTG
AGCGATAGCGATAA
TAGCGCG
GCGATGA
GCGATGC
ACGGTGC
CGGTGCG
ACGCAAC
AGCGCAC
GCGCACA
AACGCAT
ACGCATT
CAAAAGCAAAAGCA
CCGTGTGCGTGTGG
CGCGCCG
GCGCCGA
TACGCGC
ACGCGCC
TTACACA
AACGCGTACGCGTG
AACGCTG
ACGCTGC
TACGGAGACGGAGT
ACGCGTA
TGCGGCA
GCGGCAA
GCGGCAG
GGCGGCCGCGGCCT
GGCGGCGGCGGCGT
GGCGGGAGCGGGAC
TGCGGGC
GCGGGCT
TACACAGACACAGT
TTTTTTC
TTTTTCG
GGCGGTA
GCGGTAC
AGCGGTCGCGGTCA
AACGGTT
ACGGTTA
GCACAGA
TGCGCTG
GCGCTGA
CATTACC
ATTACCT
ATTACCA
GGCGTGT
GCGTGTT
AACGTTA
ACGTTAC
GTACAGG
AACGTTTACGTTTT
CACTAAA
ACTAAAT
AAGTTTCAGTTTCG
GAGTTCG
AGTTCGG
TGAAAAA
ACTACTC
CTCTGCC
TCTGCCC
AGCTATGGCTATGA
TTACATCTACATCG
CCGTTGA
CGTTGAT
TACTCCAACTCCAC
TACTGAAACTGAAA
TACTCCG
ACTCCGC
AACTCCTACTCCTG
ACTCGCC
TGCAGAA
GCAGAAC
CAACATTAACATTC
AGCTGGT
GCTGGTG
TGCTGAA
GCTGAAT
GGCTGAC
GCTGACG
GCTGACC
GACTGAGACTGAGG
CGCTGAT
GCTGATT
GCTGATC
GACTGCA
ACTGCAA
CACTGCC
ACTGCCG
ACTGCCT
CGCTGCG
GCTGCGC
GCTGCGG
TTACCAC
TACCACC
CGCTGGC
GCTGGCA
ACTGGTT
TCCGCCCCCGCCCC
AACACCAACACCAC
AGCTGTTGCTGTTG
GTGCTGG
GACGGGA
ACGGGAC
ACTTAGG
CACCATC
CACCATT
CAGTTTG
AGTTTGT AGTTTGC
ACTTCTG
TCGCCGT
ACAGCGA
CAGAGTG AGAGTGT
TGCTTTA
GCTTTAC
ACTTTCG
TAAAACA
GGCTTTT
GCTTTTT
ATAAAAC
ATAAAAA
GGGAAATGGAAATA
CCACCCG
CAGAACG
GGGAAGCGGAAGCA
AGGAAGGGGAAGGG
CAAAATC
AGGAATT
CAGACAG
AGTAAAT
GTAAATT
ACCAAAG
TGCGGTG
GCGGTGC
GGGACGC
GGACGCG
GGGACTAGGACTAG
GGGACTC
GGAACCAGAACCAC
TTACCGT
CGCGTACGCGTACA
AGTAACTGTAACTT
CAGAGTA
ATGCCAATGCCAAG
TTACCTC
CAGATAA
AGATAAA
AGATAAT
CGGATAG
GGATAGC
TTACCTG
TACCTGC
CTGACGC
TGGATTA GGATTAA
GGGATTCGGATTCC
TGCCCGC
GCCCGCG
GGGATTT
TTAAAATTAAAATA
CGGCAAG
GGGCAAT
TGGCACAGGCACAA
TAGCACC
TGGCACGGGCACGA
TAGCAGC
AGCAGCT
AGGCAGG
TGGCAGTGGCAGTG
AGGCATA
GGCATAC
TAGCATC AGCATCA
CGGCCAA
GGCCAAA
TGGCCAC
GGCCACC
GTGGTCATGGTCAC
GAGCCGAAGCCGAT
TGGCCGG
CTGAGTGTGAGTGC
CGCGTGG
GCGTGGT
CGGCCTGGGCCTGT
AGGCGAG
TAGCGATTAGCGCA
CAGCGCG
CACCGCCACCGCCA
TGGCGGC
TTATCACTATCACC
CGGCGGT
CGTACAG
CGGCGTG
TGCTGAT
GCTGATG
GCCCGAACCCGAAT
TTCGCTGTCGCTGG
GGGCTCTGGCTCTT
GGGCTGA
ACCTGGTCCTGGTA
TTTACCC
TTACCCA
CAGGAAA
AGGAAAC
ACCCCGACCCCGAA
CAACGGT
TCTACCG
CTACCGT
CGGGACT
CAGGATAAGGATAG
CAGGATG
AGGATGC
CGGGATT
CGGGCAA
GGGGCAG
GGGCAGG
GGGCAGT
TAGGCAT
CGGGCCTGGGCCTT
CGGGCTG
CGGGCTT
GGGCTTT
CACCGCT
AGGGGCA
GGGGGCCGGGGCCT
CAGGGGC
CCTATTTCTATTTT
CTACTCC
CAGGTAA
GAGGTACAGGTACA
GGGGTAGGGGTAGC
TAGGTCA
AGGTCAC
TGGGTCA
GGGTCAT
GGGTCAG
GGGGTGAGGGTGAC
CAGGTGG
AGGTGGC
TATACTT
ATACTTT
TGGGTTAGGGTTAT
CAGGTTCAGGTTCT
CGGGTTG
GGGTTGC
GGTAACG
GGTAACA
GGTAACC
GAGTAAGAGTAAGT
CGGTAAT
GGTAATG
AGTACAC
AGTACAT
TTGAGGCTGAGGCA
TGGTAGC
GGTAGCG
GGTCACT
GGTCACC
CGGTCAG
GGTCAGG
GGTCATT
TGACTGC
CGGTCGA
GGTCGAA
TGGTCGTGGTCGTT
ATCGATC
TCGATCC
CCAGGAA
GCAGCTT
GTAGCGA
GGTGCGG
GGTGCTG
TAACCAA
GGTGGCC
CAGTGGG
AGTGGGT
TGGTGGT
GGTGGTT
GAGTGTC
AGTGTCT
CAGTGTGAGTGTGG
AGTGTTG
TGGTTAC
GGTTACC
TCACCACCACCACA
GGTTCCG
TGGTTCT
GGTTCTG
GGTTGCC
GGGTTGTGGTTGTC
TGGTTTC
GGTTTCA
CCAACCA
AGCAGGAGCAGGAA
GATAAAA
CATAAACATAAACG
ATTGAAA
CTCCGCT
TCCGCTG
GATAACAATAACAA
AATAACCATAACCA
GTAACGA
ATAGAAC
TAGAACA
GTAATGA
AATACAAATACAAC
GTACACA
CCATTAC
GGTACAT
GTACATC
GCCTGTGCCTGTGT
AATAGAA
ATAGCAT
ATAGCAG
CATAGCG
ATAGCGC
TATAGGCATAGGCA
GATAGTAATAGTAT
CGCAACG
CTGCCGG
AGTATACGTATACA
ATATAGG
ACGTCACCGTCACT
ATATCAG
CTCTCTG
TCTCTGC
TATATCGATATCGG
GTAACAA
ATATGTC
ATATTCT
GATATTG
ATATTGC
GTATTTT
ACTGCACCTGCACC
CATCAAC
ATCAACG
CAAAGGT
GGTCACA GTCACAA
ATCACCA
TTCTGAC
TCTGACG
TCTGACT
GTCAGGA
ATCAGTG
ATCAGTT
TGCCGAGGCCGAGA
GTCATTA
ATCCACC
ATCCACG
CCACCGT
CACCGTC
CATCCAT
ATCCATG
AGCTTTA
GATCCCGATCCCGA
ACCGTCC
ATCCGGT
CGTTCCGGTTCCGG
CGTCCTC
GTCCTGC
CGCTGCCGCTGCCC
GTCGAAA
TCACCAT
TATCGAT
CCAGCACCAGCACC
CCGTCCT
TGTCGGGGTCGGGG
AACTGCT
ACTGCTG
GATCGTAATCGTAT
TTCACCA
TCACCAA
CGTATTCGTATTCC
ATCTACC
CGTATTG
GTATTGC
GCTGGCG
TGTCTGA
GTCTGAT
ATCAGGTTCAGGTA
CATCTGG
ATCTGGT
CATCTGTATCTGTG
TGTCTTAGTCTTAG
TTTTCTG
TTTCTGC
CTTTCGT
ATGAAAA
ATGAAAG
GTTATGGTTATGGA
GTGCTGATGCTGAC
CGTGAGT
GTGAGTA
GGTGATGGTGATGG
ATGATTG
ATGCAGA
TGTCAATGTCAATC
AATGCCA
ATGCCAG
GTGCCCG
ATGCCGA
CATGCGA ATGCGAG
GTGCGGG
CATCACC
AATGGAAATGGAAA
TGTGGAT GTGGATT
GTGGCAA
GTGGCCA
CGTGGCG
GTGGCGA
ATGGCTG
GTGGGTCCATGGTG
GTGGTTC
GTCACTA
ACGGGTT
CATGTCC
ATGTCGA
TATGTCT
ATGTCTC
CACTCCG
GGGATTAGGATTAC
TGTGTGG
GTGTGGA
CGTGTTA
GTGTTAG
GTGTTGA
GATTAAAATTAAAA
ATTACAG
CGTTACC
GTTACCG
TATTACGATTACGG
CATTACTATTACTA
TGTTAGA
GTTAGAA
CATTAGC
ATTAGCA
ATTAGCG
AGCTTCT
GCTTCTG
AGTTATAGTTATAG
GTTATCG
CATTATG
ATTATGG
CGGTACA
CATTCCG
ATTCCGG
CCAGGCA
CAGGCAG
GTTCGGC
GGTTCTCGTTCTCG
TATTCTG
ATTCTGG
GTTTCGCTTTCGCT
TGTTGAA
GTTGAAG
TATTGACATTGACT
GTTGATA
GTTGCCG
TATTGCG
ATTGCGG
TATTGCT
ATTGCTG
TACTTTAACTTTAA
TTGATAT
TCCGGTC
CCGGTCG
TTACCCGTACCCGA
AGGCTGA
TGTTTACGTTTACG
GTTTCAC
ATTTCCG
AGTTTGAGTTTGAT
GATTTGCATTTGCC
GTTTGTT
ATTTTAT
CGTTTTC
GTTTTCT
TTTGCCG
TTGCCGA
TTGCCGT
TATTTTT
ATTTTTG
TTAAAAA
CCAAAAT
ACAAAGC
CCAAAGG
CCAAATA
TGGAAAG
CAACAGG
ACAACAT
TTAACCA
TAACGAG
TAACGAC
TCAACGC
CAACGCT
CAACGCC
ACAACGTCAACGTT
AATCCGG
CGCCGAT
GCCGATT
TTCCGCT
CGAGTAA
GACTTCT
CCGAACT
CGAACTT
GCGCGTG
TATCAGC
ATCAGCG
GCAATACCAATACG
TAACCGTAACCGTG
CCAATAT
AGCTTTT
GCTTTTC
CAATGCC
TTAATTATAATTAA
CCGAAGGCGAAGGT
CAATTGA
TCACAACCACAACG
GCACAAT
TTACAGA
TACAGAT
CAGAATAAGAATAC
CCACAGG
TCAGTCTCAGTCTG
GTACATATACATAA
CCACATG
CACATGG
TCGAATC
ACACCAGCACCAGC
TGTCGAT
GTCGATC
CCTGCCG
CTGCCGT
GCAGTGC
TTACCGCTACCGCT
GCAGTGG
GGATGCT
GATGCTT
TAGCGGT
AATTGAA
TTACGCG
TATTGATATTGATG
TTACGGTTACGGTG
TTACGTCTACGTCG
TCACTAA
CACTAAC
TGAAACG
GCACTATCACTATT
TACTCCC
GTTCCGA
TCACTGC
CACTGCA
ATACTGGTACTGGA
CAGAAAA
TTAGAAG
TAGAAGC
AGTCGGTGTCGGTC
CGATGGCGATGGCC
ATAGCAATAGCAAC
TTAGCAC
CCAGCCACAGCCAA
CCAGCCG
CAGCCGG
GCGACGCCGACGCC
TTAGCGG
TCGAAAA
GTGACGG
CAGCTTC
CAGGAAT
TCAGGAT
CAGGATC
CACATCA
CATTCTG
ATTCTGA
ACGACTA
ATATTACTATTACA
GCAGGTG
GCAGGTT
GCAGTGACAGTGAG
CAGTGCG
TCAGTGG
ACAGTTGCAGTTGA
TCAGTTT
ATATAAATATAAAA
CGATTCTGATTCTA
TGGCTGG
GGCTGGT
TGTCTCT
GTCTCTG
GCATCAA
TTCTGCG
TTATCGA
TGATATT
GTTCTGG
TTCTGGG
CCATCTG
TCGAGCCCGAGCCG
CCATGAA
CATGAAA
CATGAAC
ACATGACCATGACA
ACATGCACATGCAG
CCATGCG
CCATCAC
CATGGCA
CATGGCT
ACATGGT
GCATCAG
ATTCCGCTTCCGCC
CCCAACGCCAACGT
TCATTAC
CCATTAG
CCATTAT
CATTATC
GGTGCCGGTGCCGA
TCATTCT
TTATTGACGTCTTT
GTCTTTT
ATGAAAC
CCCAAAT
TCTGAAC
CTGAACT
GCGAGTACCCAATACCAATAG
GTCAATTTCAATTA
ACCACAT
TCACCAG
TCCACCC
CCACCCT
GCCACCG
GCCTTTTCCTTTTT
ACCACGA
CCACGAA
CCACGAC
CCATGGT
TTCACTG
TCTGACCCTGACCG
ACCAGAGCCAGAGG
CCCAGCC
GACATCA
GCCAGGC
TTGACTT
TTGACTC
TTGACTG
GTCATAATCATAAC
ACCATCA
CCATCAG
TCCATGA
CTCATGGTCATGGT
TTCATTC
CTCGCCG
TCGCCGC
GCCCAAA
AATGAGTATGAGTG
GCCCACGCCCACGC
CTTCTGA
CTTCTGG
CAACGCAAACGCAG
ATGGCAT
CCCCCAC
CCCCCGC
TGGCAAA
TATCCATATCCATC
ATGATGG
AATGCCC
TTCGGCG
TCGGCGG
TTCCGAC
TCCGACT
TCCGACA
CCCGCAC
CCCGCGC
GGCATCA
CCCGGAT
CCCGGAC
TTCCGGC
TCCGGCT
TTCCGTCTCCGTCG
TTATGGC
TATGGCC
TCCTGCA
AGTCACT
TTTGCCC
TTGCCCACCGAACG
TTCGAAGTCGAAGC
CTCGAAT
TTCGACC
CCGACTA
GCCGATA
CCGATAT
TCGATCG
GTTGCGGTTGCGGA
CCGATTG
CCGCACC
CCGCATT
CCGCCAA
CCGCCCA
CCGCCGC
CTGCGCT
AGGCACA
GCCGCGGCCGCGGT
CGCATGGGCATGGT
CCCCCGACCCCGAC
TTGCGGC
CCGCTGG
CCGCTGC
CCGGATA
CCGGCGT
CCGGCTG
CCGGGAT
CCGGTAA
CCGTATT
TTGCTGA
TTCGTCG
GGACACAGACACAC
CCGTGAG
CCGTGGC
GCCTGTT
GGCAGGT
TGCCGAT
ATGCTTT
TGGCAAGGGCAAGA
ACCTGCC
TCTGCGG
CTGCGGG
TGGCATC GGCATCC
GCCTCAACCTCAAC
GCTGACA
CTGACAG
TACAAAAACAAAAC
ACCTCCACCTCCAA
ATCTCCG
TCTCCGC
CCTCGAA
ACTGGCA
CTGGCAG
CCTCTCT
TCTCTGT
ATGGCCG
TTCTGAA
CCTGACA
TCTGATA
CCTGCAT
TTCTGGA
TCTGGAA
AAATTTAAATTTAT
TCTGGGA
TCTGGTA
CTCTGTG
TCTGTGT
CCTGTTT
CTAAGGTTAAGGTA
TGATTTG
TTGAAAA
TGCCCAA
GGATTTTGATTTTT
GTGACGA
GTGCCCC
GCGCAGTCGCAGTG
CTGCCCG
TGAACTT
TTGAAGATGAAGAA
GCTGGTA
TTGAAGT
CTGAATC
GCCAAATCCAAATT
TCGACCGCGACCGA
CTGACGATGACGAC
CTGGTTA
CTGACTG
AGCGCCAGCGCCAT
GGCAACG
CTGAGGC
TGAGGCG
AATGTAGATGTAGG
ATGATAATGATAAA
CTCAAATTCAAATA
CTGATAG
CTGATGG
CCGATGTCGATGTT
GAGCCGGAGCCGGT
CTGATTT
CGCACAG
TTGCGACTGCGACA
GTGTCTG
ACCAAACCCAAACG
TGGCGAG
CTGCATG
TGCCAGG
CGCCCAA
CTGCCCC
TGCCGGG
CCGCCTACGCCTAA
GCACAAA
CCGCGATCGCGATG
GCTGGTT
CTGCGGT
CCGCTCACGCTCAG
ATTGTTG
TTGTTGG
ACGCTGA
CTGGAAA
TTGCGGTTGCGGTA
ACGACCACGACCAC
CTGGCAC
CGGCTGA
CTGGCCCTGGCCCC
ACGGCGCCGGCGCG
GTTACCT
ACGGGAGCGGGAGT
ACGCCGCCGCCGCA
TTGGGGC
TGGGGCA
GCGGGTACGGGTAA
CTGGTAG
AATTAGC
ACCGATCCCGATCC
CGGTGCT
CTGGTGG
TGGTGGC
ACGGTTGCGGTTGA
CTGGTTT
GCGGCTG
GTTCGAATTCGAAA
ACGTAATCGTAATT
CTGGCGATGGCGAC
GGCAGAGGCAGAGG
GCCAAAG
CGCAGATGCAGATT
CCACTAG
CCACTAT
CAGCGGGAGCGGGC
TTTTATTTTTATTG
GTAACGTTAACGTA
TGTCTGC
GCGTCTTCGTCTTA
CAGCGTCAGCGTCC
CCGTGATCGTGATC
TGGAAGCGGAAGCG
TTGTGCCTGTGCCG
TTTCACT
AGTGAAAGTGAAAA
CTGTGTG
AATATTGATATTGG
CTGCAAC
TGCAACT
TGTTGAC
TGTTGGG
CTGTTTA
TTGTTTCTGTTTCG
CTGTTTGTGTTTGG
GTGGAAATGGAAAA
GGGTCGAGGTCGAC
TTTAACC
TACAGGT
TTGTTAG
TGCGCCG
TTTACGC
AGTGCTG
CCGGTTG
TGGCCGCGGCCGCC
TTTCGTC
TTAGCAA
TTAGCGT
TAACCACAACCACG
GTTAGTATTAGTAG
CTTATAGTTATAGC
TGTTGGCGTTGGCG
TTATCGG
GTTATGATTATGAA
TGCATCCGCATCCA
ACTCAACCTCAACA
TTTCACATTCACAG
GAGGCGT
CCTCAGACTCAGAT
TCGGATA
CTGACCA
TTTCATT
GTTCCAATTCCAAC
CGTAGCG
ACTCCCCCTCCCCG
GTCATGCTCATGCA
TTTCCGC
TTTCCTCTTCCTCG
TTTTGCC
TTTCGAC
GCCACCT
CTTCGGATTCGGAC
GTTCGGGTTCGGGC
CAACTTTAACTTTA
GTCAAAATCAAAAA
GTCATTC
AAAACAAAAACAAT
GTTGGGG
CCTGAAC
CTGATGT
AGTGCGC
CTTGCGATTGCGAG
GGCGAGT
CTGGCGG
ACTGGACCTGGACT
GCTGGATCTGGATT
ACTTTACCTTTACC
CTGGGAC
TTGGGGT
AACCAGCACCAGCT
AATTTCAATTTCAG
ATTGTGGTTGTGGC
TTTGTTG
TTGTTGA
CTTTAAATTTAAAC
CTTTAAC
ACTTAAGCTTAAGT
CCTAGGTCTAGGTC
CTTTTCTTTTTCTT
GGGACCCGGACCCG
AACAATTACAATTA
TTTTCAT
TTTTCGA
TTTTCGGTTTCGGC
AATTTGTATTTGTC
TTTTGCATTTGCAG
TGCCCACGCCCACA
GCTTGGCCTTGGCA
CCTTCTG
CCTTTAGCTTTAGC
TTTTATC
TTTTATG
CTTTTCA
TCAGTGC
TTTTTGC
TTTTTGG
CTTTTTC
Is it fast to build? Slow?Is it small? Big?
De Bruijn graph
a_long_long_long_time Genome:
Reads:k-mers:
Pick k = 8
a_long_l _long_lo long_lon ong_long ng_long_ g_long_l _long_lo long_lon ong_long
g_long_t _long_ti long_tim ong_time
ng_long_ g_long_l
g_long_
_long_l _long_t
long_ti
ong_tim
long_lo
ong_lon
ng_long ng_time
a_long_
For each read:
For each k-mer:
Add k-mer’s left and right k-1-mers to graph if not there already. Draw an edge from left to right k-1-mer.
a_long_long_long, ng_long_l, g_long_time
De Bruijn graph
To build graph: Pick k. Usually k is short relative to read length (k = 30 to 50 is common).
Sequencer outputs d reads of length n, total length N = dn.
d = 6 x 109 reads n = 100 nt
{
≈ 1 week-long run of
Illumina HiSeq 2000
For each read:
For each k-mer:
Add k-mer’s left and right k-1-mers to graph if not there already. Draw an edge from left to right k-1-mer.
De Bruijn graph
To build graph: Pick k. Usually k is short relative to read length (k = 30 to 50 is common).
Sequencer outputs d reads of length n, total length N = dn.
d = 6 x 109 reads n = 100 nt
{
≈ 1 week-long run of
Illumina HiSeq 2000
# k-mers (edges): O(N)
# nodes is at most 2 ∙ (# edges); typically much smaller due to repeated k-1-mers
O(N)
De Bruijn graph
How much work to build graph?
For each k-mer, add 1 edge and up to 2 nodes
Reasonable to say this is O(1) expected work
Say hash map holds nodes & edges
Say k-1-mers fit in O(1) machine words, and hashing O(1) words is O(1) work
Querying / adding a key is O(1) expected work
O(1) expected work for 1 k-mer, O(N) overall
g_long_
_long_l _long_t
long_ti
ong_tim
long_lo
ong_lon
ng_long ng_time
a_long_
De Bruijn graph
Timed De Bruijn graph construction applied to progressively longer prefixes of lambda phage genome, k = 14
0 10000 20000 30000 40000 50000
0.00
0.05
0.10
0.15
0.20
Length of genome
Sec
onds
requ
ired
to b
uild
O(N) expectation works in practice
(in this case at least)
De Bruijn graph
In typical assembly projects, average coverage is ~ 30 - 50
ng_l
g_lo a_lo
_lon
long
ong_
ng_t
g_ti
_tim
time
De Bruijn graph
ng_l
g_lo a_lo
_lon
long
ong_
ng_t
g_ti
_tim
time
ng_l
g_lo
20
a_lo
_lon
10
long
ong_
30
20
ng_t
10
g_ti
_tim
10
10
20
30
time
10
Before: one edge per k-mer
After: one weighted edge per distinct k-mer
De Bruijn graph
Say (a) reads are error-free, (b) we have one weighted edge for each distinct k-mer, and (c) length of genome is G
# of nodes and edges both O(N)
So # of nodes and edges are both O(G)
Can’t have more distinct k-mers than k-mers in the genome; likewise for k-1-mers
Combine with the O(N) bound and the # of nodes and edges are both O(min(N, G))
1 node per distinct k-1-mer, 1 edge per distinct k-mer
ng_l
g_lo
20
a_lo
_lon
10
long
ong_
30
20
ng_t
10
g_ti
_tim
10
10
20
30
time
10
De Bruijn graph
At high coverage, O(min(N, G)) bound is advantageous
10 20 30 40 50
20000
40000
60000
80000
Average coverage
# de
Bru
ijn g
raph
nod
es +
edg
es
k = 30
Genome: lambda phage (~48,500 bp)
Draw random k-mers until target average coverage (x axis) is reached
Build graph, sum # nodes and # edges (y axis)
De Bruijn graph
O(G) plateau
10 20 30 40 50
20000
40000
60000
80000
Average coverage
# de
Bru
ijn g
raph
nod
es +
edg
es
k = 30
Build graph, sum # nodes and # edges (y axis)
Genome: lambda phage (~48,500 bp)
Draw random k-mers until target average coverage (x axis) is reached
At high coverage, O(min(N, G)) bound is advantageous
De Bruijn graph
Advantages
Can build in O(N) expected time, N = total length of reads
With error-free data, space is O(min(N, G)); G = genome length
Compares favorably with overlap graph
Fast construction (suffix tree) is O(N + a) time, where a is O(d2)
Overlap graph has node for every read, edge for every overlap
When average coverage is high, G ≪ N
De Bruijn graph
Disadvantages
Reads are immediately split into shorter k-mers, losing the ability to resolve some repeats resolvable by overlap graph
We lose read coherence. Some paths through De Bruijn graph are inconsistent with respect to input reads.
Only relatively short, exact overlaps are considered, which makes handling of sequencing errors more complicated
Assembly alternatives
De Bruijn Overlap
Time to build O(N)Suffix tree: O(N + a)
Dyn Prog: O(N2)
SpaceO(N)
Error-free: O(min(N, G))O(N + a)
n = # readsd = read lengthN = dn = # bases
When average coverage is high, G ≪ N and the G is the more relevant bound for De Bruijn graph size
a = # overlaps; a ∈ O(n2)G = source genome length
Error correction
When data is error-free, # nodes, edges in De Bruijn graph is O(min(G, N))
10 20 30 40 50
0e+002e+044e+046e+048e+041e+05
Average coverage
# de
Bru
ijn g
raph
nod
es +
edg
es
What about data with sequencing errors?O(G) plateau
k = 30
Average Lambda phage coverage
# D
e Br
uijn
gra
ph e
dges
Error correction
For large k, set of k-mers in genome is tiny subset of all 4k k-mers
Errors tend to yield new k-mers that don’t appear elsewhere
Given k-mer from genome, we expect most of its neighbors (e.g. by Hamming distance) are not in the genome
How many possible DNA strings of length k? 4k
How many possible DNA strings of length 20? 420 = 240 ≈ 1 trillionHow many strings of length 20 in human genome? ~3 billion
Analogy: correctly / incorrectly spelled words in collection of documents
Error correction
Correcting errors up-front prevents De Bruijn graph from growing far beyond O(G) plateau
How to correct?
Analogy: how to spell check a language you’ve never seen before?
Errors tend to turn frequent words (k-mers) to infrequent ones. Corrections should do the reverse.
Error correctionng_l
g_lo
20
a_lo
_lon
10
long
ong_
30
20
ng_t
10
g_ti
_tim
10
10
20
30
time
10
Left: Take example, mutate a k-mer character randomly with probability 1%
ng_l
g_lo
20
a_lo
_lon
9
lolg
olg_
1
long
ong_
29
onga
ngal
1
atlo
tlon
1
19
ng_t
10
g_ti
_tim
10
_l_n
l_ng
1
1
10
_loo
loog
1
20
27
time
10
ng_l
g_lo
20
a_lo
_lon
9
lolg
olg_
1
long
ong_
29
onga
ngal
1
atlo
tlon
1
19
ng_t
10
g_ti
_tim
10
_l_n
l_ng
1
1
10
_loo
loog
1
20
27
time
10
Right: 6 errors yield 10 new nodes, 6 new weighted edges, all with weight 1
Error correction
Same experiment as before, with 5% error added
10 20 30 40 50
050000
150000
250000
Average coverage
# de
Bru
ijn g
raph
nod
es +
edg
es
0% error
5% error
O(G) plateau
Errors "push through" G bound
As more k-mers overlap errors, # nodes & edges approach N
k = 30
Average Lambda phage coverageAverage Lambda phage coverage
# D
e Br
uijn
gra
ph e
dges
0%
G
10 20 30 40 50
050000
150000
250000
Average coverage
# de
Bru
ijn g
raph
nod
es +
edg
es
1%
Error correction
Same experiment as before, with 5% error added5%
Errors "push through" G bound
As more k-mers overlap errors, # nodes & edges approach N
k = 30
Average Lambda phage coverage
(Now with 1% error added)
# D
e Br
uijn
gra
ph e
dges
Average Lambda phage coverage
Error correction
k-mer count histogram:
x axis is an integer k-mer count, y axis is # distinct k-mers with that count
Right: such a histogram for 3-mers of CATCATCATCATCAT:
CAT occurs 5 times
ATC and TCA occur 4 times
Error correction
~ 6,100 distinct k-mers occurred exactly 10 times in the input
5 10 15 20 25
02000
4000
6000
8000
10000
k-mer count
# di
stin
ct k
-mer
s w
ith th
at c
ount
Error-free
Draw 20-mers from genome randomly until each 20-mer has been drawn 10 times on average
How would the picture change for data with 1% error rate?
Error correction
~ 9.6K k-mers occur once
32 k-mers occur once
k-mers with errors usually occur fewer times than error-free k-mers
5 10 15 20 25
02000
4000
6000
8000
10000
k-mer count
# di
stin
ct k
-mer
s w
ith th
at c
ount
Error-free0.1% error
Error correction
Idea: errors tend to turn frequent k-mers to infrequent k-mers, so corrections should do the reverse
Say each 8-mer occurs an average of ~10 times:
GCGTATTACGCGTCTGGCCT
GCGTATTA: 8 CGTATTAC: 8 GTATTACG: 9 TATTACGC: 9 ATTACGCG: 10 TTACGCGT: 10 TACGCGTC: 11 ACGCGTCT: 11 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8
Read:
8-mers:
# times each 8-mer occurs in the reads. “k-mer count profile”
All 8-mer counts are near average, suggesting read is error-free
(20 nt)
Error correction
Suppose there’s an error
GCGTACTACGCGTCTGGCCT
GCGTACTA: 1 CGTACTAC: 2 GTACTACG: 1 TACTACGC: 1 ACTACGCG: 2 CTACGCGT: 1 TACGCGTC: 9 ACGCGTCT: 8 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8
Read:
Around average
Below averagek-mer count profile has corresponding stretch of below-average counts
Error correction
k-mer counts when errors are in different parts of the read:
GCGTACTACGCGTCTGGCCTGCGTACTA: 1 CGTACTAC: 3 GTACTACG: 1 TACTACGC: 1 ACTACGCG: 2 CTACGCGT: 1 TACGCGTC: 9 ACGCGTCT: 8 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8
GCGTATTACACGTCTGGCCTGCGTATTA: 8 CGTATTAC: 8 GTATTACA: 1 TATTACAC: 1 ATTACACG: 1 TTACACGT: 1 TACACGTC: 1 ACACGTCT: 2 CACGTCTG: 1 ACGTCTGG: 1 CGTCTGGC: 11 GTCTGGCC: 9 TCTGGCCT: 8
GCGTATTACGCGTCTGGTCTGCGTATTA: 8 CGTATTAC: 8 GTATTACG: 9 TATTACGC: 9 ATTACGCG: 9 TTACGCGT: 12 TACGCGTC: 9 ACGCGTCT: 8 CGCGTCTG: 10 GCGTCTGG: 10 CGTCTGGT: 1 GTCTGGTC: 2 TCTGGTCT: 1
Index
y
Index
y
Index
y
Error correction
Count profile indicates where errors are
2 4 6 8 10 12
24
68
10
k-mer position
count
These probably overlap an error
Error correction
Simple algorithm, given a count threshold t:
For each read:
For each k-mer:
If k-mer count < t:
Examine k-mer’s neighbors within some Hamming/edit distance. If neighbor has count ≥ t, replace old k-mer with neighbor.
5 10 15 20 25
02000
4000
6000
8000
10000
k-mer count
# di
stin
ct k
-mer
s w
ith th
at c
ount
Error-free0.1% error
Pick t corresponding to dip between the peaks
Error correction: implementation excerpt
def correct1mm(read, k, kmerhist, alpha, thresh): ''' Return an error-corrected version of read. k = k-mer length. kmerhist is kmer count map. alpha is alphabet. thresh is count threshold above which k-mer is considered correct. ''' # Iterate over k-mers in read for i in range(0, len(read)-(k-1)): kmer = read[i:i+k] # If k-mer is infrequent... if kmerhist.get(kmer, 0) <= thresh: # Look for a frequent neighbor for newkmer in neighbors1mm(kmer, alpha): if kmerhist.get(newkmer, 0) > thresh: # replace with neighbor read = read[:i] + newkmer + read[i+k:] break return read
Full Python example: http://bit.ly/CG_ErrorCorrect
Error correction: results
Corrects 99.2% of errors in an example with 0.1% error added
Before After
0 50 100 150
01000
2000
3000
4000
k-mer count
# di
stin
ct k
-mer
s w
ith th
at c
ount
Error-free0.1% error
0 50 100 150
01000
2000
3000
4000
k-mer count
# di
stin
ct k
-mer
s w
ith th
at c
ount
Error-free0.1% error, corrected
From 194K k-mers occurring exactly once to just 355
Error correction: results
5 10 15
60000
100000
140000
Average coverage
# de
Bru
ijn g
raph
nod
es +
edg
es
Error-free1% error, corrected1% error, uncorrectedG bound
Uncorrected, graph size is off the chart
Corrected, graph size is near G bound
Also works for 1% error...#
De
Brui
jn g
raph
edg
es
Average Lambda phage coverage
...provided enough coverage to distinguish frequent/infrequent
Error correction
Average coverage & k must be such that we can distinguish frequent from infrequent k-mers
To work well:
k-mer neighborhood explored must be broad enough to find frequent neighbors. Depends on error rate and k.
Data structure for storing k-mer counts should be smaller than the De Bruijn graph
Otherwise, what's the point? 🙂
Alternately, we might give up on correcting and simply remove bad k-mers
Data structures for error correction
Bloom filters
Counting quotient filters
Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biology. 2014;15(11):509.
Pandey P, Bender MA, Johnson R, Patro R. Squeakr: an exact and approximate k-mer counting system. Bioinformatics. 2017; btx636.
Don’t need 100% accurate k-mer counts; just have to distinguish frequent and infrequent
CountMin sketchesCrusoe MR, Alameldin HF, Awad S, Boucher E, ..., Brown CT. The khmer software package: enabling efficient nucleotide sequence analysis. F1000 Research. 2015 Sep 25;4:900.
Assembly alternatives
Overlap
Layout
Consensus
Error correction
De Bruijn graph
Scaffolding
Refine
CGAAAAA
GAAAAAC
AAAAAAC
AAAAACC
AAAAACTAAAAACA
GAAAAAGAAAAAGG
TAAAAAT
AAAAATT
GGAAACA
GAAACAC
AAAACCA
AAAAAAA
AAAAAAG
TGAAAACGAAAACT
AAAAAGA
AAAAGAG
AAAAAGC
AAAAGCC
AAAAGGC
GAAAATGAAAATGT
TAAAATT
AAAATTT
GTATCTATATCTAC
AAACACA
CGCCATT
GCCATTA
AAAACAT
AAACATG
AAACCAT
TGACGGTGACGGTA
AAAACCGAAACCGC
AGCCCGC
GCCCGCA
GAAACGC
AAACGCA
AGAACGTGAACGTT
AAAACTG
AAACTGC
AAAACTTAAACTTT
AAGGTAA
AGGTAAC
AAAGAGT
GAAAGCA
AAAGCAA
AAAGCAG
AAAGCCC
AGAAGCG
GAAGCGC
AAAGGCG
TGAAGTT
GAAGTTC
GAAGTTA
CAAATAA
AAATAAA
TAAATAC
AAATACT
CAGCGAT AGCGATG
AAAATCA
AAATCAC
AAATCAT
CTTAGGT
TTAGGTC
GGAAAGC
CAAATGC
AAATGCA
AAATGTC
AAATGTA
AAAATTA
AAATTAC
GGAATTT
GAATTTG
AAACAAAAACAAAG
TAACAAC
AACAACC
AACAACG
AAGCGCG
AGCGCGT
AACACAG
AGAAAAA
GAAAAAA
AGACAGA
GACAGAT
CTAAATA
TGACAGTGACAGTG
CTGGTAA
TGGTAACCAACATC
AACATCC
AACATGT
TGACATTGACATTG
CGACCAA
GACCAAA
TGACCAC
GACCACA
GACCACC
CGACCAG
GACCAGG
CAACCAT
AACCATC
AACCATG
ATGGTGC
TGGTGCT
GTGGGAC
TGGGACG
CTGACGG
TGACGGG
TGACCTGGACCTGG
CCCCGCC
CCCGCCA
AAAAAATAAAAATG
CTGCTGG
TGCTGGC
TGCTGGT
TGACGAG
GACGAGA
CAACGATAACGATG
GGACGCA GACGCAA
TGACGCG
GACGCGT
GGACGCTGACGCTA
CAACGGG
AACGGGC
TAACGGT
AACGGTG
GAACGTA
AACGTAT
AAAACAGAAACAGC
ACGGTTC
CGGTTCC
CGACTAC
GACTACT
CGCAATGGCAATGG
CAGGTTG
AGGTTGC
GGACTCGGACTCGC
TAACTGCAACTGCC
GAACTGG
AACTGGT
TGACTTA
GACTTAG
GAACTTC
AACTTCT
AACTTTC
CGAGAAAGAGAAAA
GAGAAAT
GCGTAAACGTAAAT
TGCAACG
GCAACGG
AACCGGTACCGGTC
ACCACCA
CCACCAT
CCACCAC
CACCGTAACCGTAT
AAGAGTG
TAAATTA
AAATTAA
AAGCAAT
TAGTGCG
AGTGCGG
AAGCAGA
CTGCGATTGCGATG
AAGCCCG
CAAGCCG
AAGCCGC
ACACAGA
CACAGAA
GTAACGG
CGAGCTG
GAGCTGG
AGAAAAT
GTTCTGATTCTGAT
CGAACGTGAACGTC
AGAGGCAGAGGCAG
AAGGCGA
AAAGGTA
GAAGGTTAAGGTTT
TGAGTAA
GAGTAAA
AGAGTAC
GAGTACA
TGAACTG
ACCGCTG
CCGCTGA
CCTATATCTATATA
CGAGTGTGAGTGTT
CGAGTTAGAGTTAG
AAGTTCG
AATAAAA
CGAGGTA
GAGGTAA
AATACTT
ACATCAG
CATCAGT
TGATAGC
GATAGCA
GTCCTCT
TCCTCTC
CAATATAAATATAGCAATATC
AATATCA
CAATATGAATATGT
CGATATT
GATATTC
AATCACC
AAACAGTAACAGTT
GAATCCA
AATCCAC
CAATCCGAATCCGC
TTCCTGCTCCTGCG
CGATCGC
GATCGCC
ACGTATTCGTATTT
GAATCTA
AATCTAC
CAAAGCC
TAATGAA
AATGAAA
TGCGGGT
GCGGGTT
GCGGGTG
TAATGAGAATGAGA
CGATGAT
GATGATT
AATGCAG
CGATGCC
GATGCCG
CACACCAACACCAT
CTTTTTT
TTTTTTT
TGATGGC
GATGGCT
ACGGGGCCGGGGCT
AATGTCG
AATTAAA
AATTACA
GGCGCTGGCGCTGT
GAATTCCAATTCCG
AAATTCGAATTCGG
CAAAGGCAAAGGCA
TGATTGA
GATTGAA
TGATTTAGATTTAG
GGATTTC
GATTTCC
AATTTGC
AAATTTT
AATTTTA
GACAAAAACAAAAG
GGCAAAT
GCAAATG
TACAACA
ACAACAG
ACAACCA
GGCAAGC
GCAAGCC
ATGTCCT
TGTCCTG
GGCAATA
GCAATAT
AGCAATG
GCAATGC
CACAATT
ACAATTG
TACACAA ACACAAC
TCGTCGA
CGTCGAC
AACACATACACATT
AGCACCA
GCACCAC
GCACCAG
TAGCGAGAGCGAGA
CGCACCGGCACCGA
CGCACCT
GCACCTG
GACACGCACACGCC
CACCGAAACCGAAA
GTCGACCTCGACCA
ACAGAAA
CACAGAC
ACAGACA
TACAGAG
ACAGAGT
ACAGATA
AGCAGCGGCAGCGT
CGCAGCTGCAGCTG
TACAGGA
ACAGGAA
GGCAGGG
GCAGGGG
GCAGGGC
CACAGGT
ACAGGTA
TGCAGTAGCAGTAC
ACAGTGC
GACATACACATACA
GGCATAGGCATAGC
TACATCA
ACATCCA
CCGTTAT
CGTTATC
CACATGAACATGAT
GCTATATCTATATG
TGCATGG
GCATGGC
ACATGTC
CGCATTA
GCATTAG
CGCATTC
GCATTCC
GCATTCG
GGCATTGGCATTGC
TGAATCC
CGCCAAA
GCCAAAA
CACCAAC
ACCAACC
AACCAAT
ACCAATA
TACCACA
ACCACAG
CACCACC
TAAAAACCACCACT
ACCACTA
TACCAGCACCAGCA
ACCAGGA
CGAATCT
ACCATCT
ACCATGC
AACCATT
ACCATTA
TACCCAA
ACCCAAT
TACGGCCACGGCCG
CGCCCAG
GCCCAGC
TGCCCCC
GCCCCCA
GCCCCCG
GAAATGC
CACCCGCACCCGCC
ACCCGCA
TGCCCGG
GCCCGGA
AGCGATTGCGATTA
CGTCGAAGTCGAAT
CGCCCTGGCCCTGC
CAGTGCC
AGTGCCC
TGCCGAA
GCCGAAC
TACCGAGACCGAGG
GGAATGTGAATGTG
AGCCGCA
GCCGCAT
CGCCGCC
GCCGCCC
GCCGCCG
ACGGGCA
CGATCCG
GATCCGG
GTCAGCG
TCAGCGA
GGCCGGC
GCCGGCG
AGCCGGG
GCCGGGA
TGCCGGT
GCCGGTA
GCCGGTC
CGCCGTA
GCCGTAT
TACCGTCACCGTCA
TGCCGTG
GCCGTGA
GCCGTGG
TACCGTT
ACCGTTA
ACCGTTG
TTAATTGTAATTGA
TACCTCG
ACCTCGA
CAGTGGC
AGTGGCA
CACCTGA
ACCTGAC
ATCGCCA
TCGCCAT
TGCCTGGGCCTGGT
GGCGCGTGCGCGTT
CAGCGTGAGCGTGT
CACCTTTACCTTTA
GCGGGTCCGGGTCG
TCACAAT
TCGTGGT
CGTGGTC
CGTGGTT
CCCGCCG
CCGCCGT
TAAAAAA
CACAACA
GGCGAGA
GCGAGAA
GGCGAGC
GCGAGCT
AACGAGG
ACGAGGT
TGCGAGT GCGAGTG
AGCGATAGCGATAA
TAGCGCG
GCGATGA
GCGATGC
ACGGTGC
CGGTGCG
ACGCAAC
AGCGCAC
GCGCACA
AACGCAT
ACGCATT
CAAAAGCAAAAGCA
CCGTGTGCGTGTGG
CGCGCCG
GCGCCGA
TACGCGC
ACGCGCC
TTACACA
AACGCGTACGCGTG
AACGCTG
ACGCTGC
TACGGAGACGGAGT
ACGCGTA
TGCGGCA
GCGGCAA
GCGGCAG
GGCGGCCGCGGCCT
GGCGGCGGCGGCGT
GGCGGGAGCGGGAC
TGCGGGC
GCGGGCT
TACACAGACACAGT
TTTTTTC
TTTTTCG
GGCGGTA
GCGGTAC
AGCGGTCGCGGTCA
AACGGTT
ACGGTTA
GCACAGA
TGCGCTG
GCGCTGA
CATTACC
ATTACCT
ATTACCA
GGCGTGT
GCGTGTT
AACGTTA
ACGTTAC
GTACAGG
AACGTTTACGTTTT
CACTAAA
ACTAAAT
AAGTTTCAGTTTCG
GAGTTCG
AGTTCGG
TGAAAAA
ACTACTC
CTCTGCC
TCTGCCC
AGCTATGGCTATGA
TTACATCTACATCG
CCGTTGA
CGTTGAT
TACTCCAACTCCAC
TACTGAAACTGAAA
TACTCCG
ACTCCGC
AACTCCTACTCCTG
ACTCGCC
TGCAGAA
GCAGAAC
CAACATTAACATTC
AGCTGGT
GCTGGTG
TGCTGAA
GCTGAAT
GGCTGAC
GCTGACG
GCTGACC
GACTGAGACTGAGG
CGCTGAT
GCTGATT
GCTGATC
GACTGCA
ACTGCAA
CACTGCC
ACTGCCG
ACTGCCT
CGCTGCG
GCTGCGC
GCTGCGG
TTACCAC
TACCACC
CGCTGGC
GCTGGCA
ACTGGTT
TCCGCCCCCGCCCC
AACACCAACACCAC
AGCTGTTGCTGTTG
GTGCTGG
GACGGGA
ACGGGAC
ACTTAGG
CACCATC
CACCATT
CAGTTTG
AGTTTGT AGTTTGC
ACTTCTG
TCGCCGT
ACAGCGA
CAGAGTG AGAGTGT
TGCTTTA
GCTTTAC
ACTTTCG
TAAAACA
GGCTTTT
GCTTTTT
ATAAAAC
ATAAAAA
GGGAAATGGAAATA
CCACCCG
CAGAACG
GGGAAGCGGAAGCA
AGGAAGGGGAAGGG
CAAAATC
AGGAATT
CAGACAG
AGTAAAT
GTAAATT
ACCAAAG
TGCGGTG
GCGGTGC
GGGACGC
GGACGCG
GGGACTAGGACTAG
GGGACTC
GGAACCAGAACCAC
TTACCGT
CGCGTACGCGTACA
AGTAACTGTAACTT
CAGAGTA
ATGCCAATGCCAAG
TTACCTC
CAGATAA
AGATAAA
AGATAAT
CGGATAG
GGATAGC
TTACCTG
TACCTGC
CTGACGC
TGGATTA GGATTAA
GGGATTCGGATTCC
TGCCCGC
GCCCGCG
GGGATTT
TTAAAATTAAAATA
CGGCAAG
GGGCAAT
TGGCACAGGCACAA
TAGCACC
TGGCACGGGCACGA
TAGCAGC
AGCAGCT
AGGCAGG
TGGCAGTGGCAGTG
AGGCATA
GGCATAC
TAGCATC AGCATCA
CGGCCAA
GGCCAAA
TGGCCAC
GGCCACC
GTGGTCATGGTCAC
GAGCCGAAGCCGAT
TGGCCGG
CTGAGTGTGAGTGC
CGCGTGG
GCGTGGT
CGGCCTGGGCCTGT
AGGCGAG
TAGCGATTAGCGCA
CAGCGCG
CACCGCCACCGCCA
TGGCGGC
TTATCACTATCACC
CGGCGGT
CGTACAG
CGGCGTG
TGCTGAT
GCTGATG
GCCCGAACCCGAAT
TTCGCTGTCGCTGG
GGGCTCTGGCTCTT
GGGCTGA
ACCTGGTCCTGGTA
TTTACCC
TTACCCA
CAGGAAA
AGGAAAC
ACCCCGACCCCGAA
CAACGGT
TCTACCG
CTACCGT
CGGGACT
CAGGATAAGGATAG
CAGGATG
AGGATGC
CGGGATT
CGGGCAA
GGGGCAG
GGGCAGG
GGGCAGT
TAGGCAT
CGGGCCTGGGCCTT
CGGGCTG
CGGGCTT
GGGCTTT
CACCGCT
AGGGGCA
GGGGGCCGGGGCCT
CAGGGGC
CCTATTTCTATTTT
CTACTCC
CAGGTAA
GAGGTACAGGTACA
GGGGTAGGGGTAGC
TAGGTCA
AGGTCAC
TGGGTCA
GGGTCAT
GGGTCAG
GGGGTGAGGGTGAC
CAGGTGG
AGGTGGC
TATACTT
ATACTTT
TGGGTTAGGGTTAT
CAGGTTCAGGTTCT
CGGGTTG
GGGTTGC
GGTAACG
GGTAACA
GGTAACC
GAGTAAGAGTAAGT
CGGTAAT
GGTAATG
AGTACAC
AGTACAT
TTGAGGCTGAGGCA
TGGTAGC
GGTAGCG
GGTCACT
GGTCACC
CGGTCAG
GGTCAGG
GGTCATT
TGACTGC
CGGTCGA
GGTCGAA
TGGTCGTGGTCGTT
ATCGATC
TCGATCC
CCAGGAA
GCAGCTT
GTAGCGA
GGTGCGG
GGTGCTG
TAACCAA
GGTGGCC
CAGTGGG
AGTGGGT
TGGTGGT
GGTGGTT
GAGTGTC
AGTGTCT
CAGTGTGAGTGTGG
AGTGTTG
TGGTTAC
GGTTACC
TCACCACCACCACA
GGTTCCG
TGGTTCT
GGTTCTG
GGTTGCC
GGGTTGTGGTTGTC
TGGTTTC
GGTTTCA
CCAACCA
AGCAGGAGCAGGAA
GATAAAA
CATAAACATAAACG
ATTGAAA
CTCCGCT
TCCGCTG
GATAACAATAACAA
AATAACCATAACCA
GTAACGA
ATAGAAC
TAGAACA
GTAATGA
AATACAAATACAAC
GTACACA
CCATTAC
GGTACAT
GTACATC
GCCTGTGCCTGTGT
AATAGAA
ATAGCAT
ATAGCAG
CATAGCG
ATAGCGC
TATAGGCATAGGCA
GATAGTAATAGTAT
CGCAACG
CTGCCGG
AGTATACGTATACA
ATATAGG
ACGTCACCGTCACT
ATATCAG
CTCTCTG
TCTCTGC
TATATCGATATCGG
GTAACAA
ATATGTC
ATATTCT
GATATTG
ATATTGC
GTATTTT
ACTGCACCTGCACC
CATCAAC
ATCAACG
CAAAGGT
GGTCACA GTCACAA
ATCACCA
TTCTGAC
TCTGACG
TCTGACT
GTCAGGA
ATCAGTG
ATCAGTT
TGCCGAGGCCGAGA
GTCATTA
ATCCACC
ATCCACG
CCACCGT
CACCGTC
CATCCAT
ATCCATG
AGCTTTA
GATCCCGATCCCGA
ACCGTCC
ATCCGGT
CGTTCCGGTTCCGG
CGTCCTC
GTCCTGC
CGCTGCCGCTGCCC
GTCGAAA
TCACCAT
TATCGAT
CCAGCACCAGCACC
CCGTCCT
TGTCGGGGTCGGGG
AACTGCT
ACTGCTG
GATCGTAATCGTAT
TTCACCA
TCACCAA
CGTATTCGTATTCC
ATCTACC
CGTATTG
GTATTGC
GCTGGCG
TGTCTGA
GTCTGAT
ATCAGGTTCAGGTA
CATCTGG
ATCTGGT
CATCTGTATCTGTG
TGTCTTAGTCTTAG
TTTTCTG
TTTCTGC
CTTTCGT
ATGAAAA
ATGAAAG
GTTATGGTTATGGA
GTGCTGATGCTGAC
CGTGAGT
GTGAGTA
GGTGATGGTGATGG
ATGATTG
ATGCAGA
TGTCAATGTCAATC
AATGCCA
ATGCCAG
GTGCCCG
ATGCCGA
CATGCGA ATGCGAG
GTGCGGG
CATCACC
AATGGAAATGGAAA
TGTGGAT GTGGATT
GTGGCAA
GTGGCCA
CGTGGCG
GTGGCGA
ATGGCTG
GTGGGTCCATGGTG
GTGGTTC
GTCACTA
ACGGGTT
CATGTCC
ATGTCGA
TATGTCT
ATGTCTC
CACTCCG
GGGATTAGGATTAC
TGTGTGG
GTGTGGA
CGTGTTA
GTGTTAG
GTGTTGA
GATTAAAATTAAAA
ATTACAG
CGTTACC
GTTACCG
TATTACGATTACGG
CATTACTATTACTA
TGTTAGA
GTTAGAA
CATTAGC
ATTAGCA
ATTAGCG
AGCTTCT
GCTTCTG
AGTTATAGTTATAG
GTTATCG
CATTATG
ATTATGG
CGGTACA
CATTCCG
ATTCCGG
CCAGGCA
CAGGCAG
GTTCGGC
GGTTCTCGTTCTCG
TATTCTG
ATTCTGG
GTTTCGCTTTCGCT
TGTTGAA
GTTGAAG
TATTGACATTGACT
GTTGATA
GTTGCCG
TATTGCG
ATTGCGG
TATTGCT
ATTGCTG
TACTTTAACTTTAA
TTGATAT
TCCGGTC
CCGGTCG
TTACCCGTACCCGA
AGGCTGA
TGTTTACGTTTACG
GTTTCAC
ATTTCCG
AGTTTGAGTTTGAT
GATTTGCATTTGCC
GTTTGTT
ATTTTAT
CGTTTTC
GTTTTCT
TTTGCCG
TTGCCGA
TTGCCGT
TATTTTT
ATTTTTG
TTAAAAA
CCAAAAT
ACAAAGC
CCAAAGG
CCAAATA
TGGAAAG
CAACAGG
ACAACAT
TTAACCA
TAACGAG
TAACGAC
TCAACGC
CAACGCT
CAACGCC
ACAACGTCAACGTT
AATCCGG
CGCCGAT
GCCGATT
TTCCGCT
CGAGTAA
GACTTCT
CCGAACT
CGAACTT
GCGCGTG
TATCAGC
ATCAGCG
GCAATACCAATACG
TAACCGTAACCGTG
CCAATAT
AGCTTTT
GCTTTTC
CAATGCC
TTAATTATAATTAA
CCGAAGGCGAAGGT
CAATTGA
TCACAACCACAACG
GCACAAT
TTACAGA
TACAGAT
CAGAATAAGAATAC
CCACAGG
TCAGTCTCAGTCTG
GTACATATACATAA
CCACATG
CACATGG
TCGAATC
ACACCAGCACCAGC
TGTCGAT
GTCGATC
CCTGCCG
CTGCCGT
GCAGTGC
TTACCGCTACCGCT
GCAGTGG
GGATGCT
GATGCTT
TAGCGGT
AATTGAA
TTACGCG
TATTGATATTGATG
TTACGGTTACGGTG
TTACGTCTACGTCG
TCACTAA
CACTAAC
TGAAACG
GCACTATCACTATT
TACTCCC
GTTCCGA
TCACTGC
CACTGCA
ATACTGGTACTGGA
CAGAAAA
TTAGAAG
TAGAAGC
AGTCGGTGTCGGTC
CGATGGCGATGGCC
ATAGCAATAGCAAC
TTAGCAC
CCAGCCACAGCCAA
CCAGCCG
CAGCCGG
GCGACGCCGACGCC
TTAGCGG
TCGAAAA
GTGACGG
CAGCTTC
CAGGAAT
TCAGGAT
CAGGATC
CACATCA
CATTCTG
ATTCTGA
ACGACTA
ATATTACTATTACA
GCAGGTG
GCAGGTT
GCAGTGACAGTGAG
CAGTGCG
TCAGTGG
ACAGTTGCAGTTGA
TCAGTTT
ATATAAATATAAAA
CGATTCTGATTCTA
TGGCTGG
GGCTGGT
TGTCTCT
GTCTCTG
GCATCAA
TTCTGCG
TTATCGA
TGATATT
GTTCTGG
TTCTGGG
CCATCTG
TCGAGCCCGAGCCG
CCATGAA
CATGAAA
CATGAAC
ACATGACCATGACA
ACATGCACATGCAG
CCATGCG
CCATCAC
CATGGCA
CATGGCT
ACATGGT
GCATCAG
ATTCCGCTTCCGCC
CCCAACGCCAACGT
TCATTAC
CCATTAG
CCATTAT
CATTATC
GGTGCCGGTGCCGA
TCATTCT
TTATTGACGTCTTT
GTCTTTT
ATGAAAC
CCCAAAT
TCTGAAC
CTGAACT
GCGAGTACCCAATACCAATAG
GTCAATTTCAATTA
ACCACAT
TCACCAG
TCCACCC
CCACCCT
GCCACCG
GCCTTTTCCTTTTT
ACCACGA
CCACGAA
CCACGAC
CCATGGT
TTCACTG
TCTGACCCTGACCG
ACCAGAGCCAGAGG
CCCAGCC
GACATCA
GCCAGGC
TTGACTT
TTGACTC
TTGACTG
GTCATAATCATAAC
ACCATCA
CCATCAG
TCCATGA
CTCATGGTCATGGT
TTCATTC
CTCGCCG
TCGCCGC
GCCCAAA
AATGAGTATGAGTG
GCCCACGCCCACGC
CTTCTGA
CTTCTGG
CAACGCAAACGCAG
ATGGCAT
CCCCCAC
CCCCCGC
TGGCAAA
TATCCATATCCATC
ATGATGG
AATGCCC
TTCGGCG
TCGGCGG
TTCCGAC
TCCGACT
TCCGACA
CCCGCAC
CCCGCGC
GGCATCA
CCCGGAT
CCCGGAC
TTCCGGC
TCCGGCT
TTCCGTCTCCGTCG
TTATGGC
TATGGCC
TCCTGCA
AGTCACT
TTTGCCC
TTGCCCACCGAACG
TTCGAAGTCGAAGC
CTCGAAT
TTCGACC
CCGACTA
GCCGATA
CCGATAT
TCGATCG
GTTGCGGTTGCGGA
CCGATTG
CCGCACC
CCGCATT
CCGCCAA
CCGCCCA
CCGCCGC
CTGCGCT
AGGCACA
GCCGCGGCCGCGGT
CGCATGGGCATGGT
CCCCCGACCCCGAC
TTGCGGC
CCGCTGG
CCGCTGC
CCGGATA
CCGGCGT
CCGGCTG
CCGGGAT
CCGGTAA
CCGTATT
TTGCTGA
TTCGTCG
GGACACAGACACAC
CCGTGAG
CCGTGGC
GCCTGTT
GGCAGGT
TGCCGAT
ATGCTTT
TGGCAAGGGCAAGA
ACCTGCC
TCTGCGG
CTGCGGG
TGGCATC GGCATCC
GCCTCAACCTCAAC
GCTGACA
CTGACAG
TACAAAAACAAAAC
ACCTCCACCTCCAA
ATCTCCG
TCTCCGC
CCTCGAA
ACTGGCA
CTGGCAG
CCTCTCT
TCTCTGT
ATGGCCG
TTCTGAA
CCTGACA
TCTGATA
CCTGCAT
TTCTGGA
TCTGGAA
AAATTTAAATTTAT
TCTGGGA
TCTGGTA
CTCTGTG
TCTGTGT
CCTGTTT
CTAAGGTTAAGGTA
TGATTTG
TTGAAAA
TGCCCAA
GGATTTTGATTTTT
GTGACGA
GTGCCCC
GCGCAGTCGCAGTG
CTGCCCG
TGAACTT
TTGAAGATGAAGAA
GCTGGTA
TTGAAGT
CTGAATC
GCCAAATCCAAATT
TCGACCGCGACCGA
CTGACGATGACGAC
CTGGTTA
CTGACTG
AGCGCCAGCGCCAT
GGCAACG
CTGAGGC
TGAGGCG
AATGTAGATGTAGG
ATGATAATGATAAA
CTCAAATTCAAATA
CTGATAG
CTGATGG
CCGATGTCGATGTT
GAGCCGGAGCCGGT
CTGATTT
CGCACAG
TTGCGACTGCGACA
GTGTCTG
ACCAAACCCAAACG
TGGCGAG
CTGCATG
TGCCAGG
CGCCCAA
CTGCCCC
TGCCGGG
CCGCCTACGCCTAA
GCACAAA
CCGCGATCGCGATG
GCTGGTT
CTGCGGT
CCGCTCACGCTCAG
ATTGTTG
TTGTTGG
ACGCTGA
CTGGAAA
TTGCGGTTGCGGTA
ACGACCACGACCAC
CTGGCAC
CGGCTGA
CTGGCCCTGGCCCC
ACGGCGCCGGCGCG
GTTACCT
ACGGGAGCGGGAGT
ACGCCGCCGCCGCA
TTGGGGC
TGGGGCA
GCGGGTACGGGTAA
CTGGTAG
AATTAGC
ACCGATCCCGATCC
CGGTGCT
CTGGTGG
TGGTGGC
ACGGTTGCGGTTGA
CTGGTTT
GCGGCTG
GTTCGAATTCGAAA
ACGTAATCGTAATT
CTGGCGATGGCGAC
GGCAGAGGCAGAGG
GCCAAAG
CGCAGATGCAGATT
CCACTAG
CCACTAT
CAGCGGGAGCGGGC
TTTTATTTTTATTG
GTAACGTTAACGTA
TGTCTGC
GCGTCTTCGTCTTA
CAGCGTCAGCGTCC
CCGTGATCGTGATC
TGGAAGCGGAAGCG
TTGTGCCTGTGCCG
TTTCACT
AGTGAAAGTGAAAA
CTGTGTG
AATATTGATATTGG
CTGCAAC
TGCAACT
TGTTGAC
TGTTGGG
CTGTTTA
TTGTTTCTGTTTCG
CTGTTTGTGTTTGG
GTGGAAATGGAAAA
GGGTCGAGGTCGAC
TTTAACC
TACAGGT
TTGTTAG
TGCGCCG
TTTACGC
AGTGCTG
CCGGTTG
TGGCCGCGGCCGCC
TTTCGTC
TTAGCAA
TTAGCGT
TAACCACAACCACG
GTTAGTATTAGTAG
CTTATAGTTATAGC
TGTTGGCGTTGGCG
TTATCGG
GTTATGATTATGAA
TGCATCCGCATCCA
ACTCAACCTCAACA
TTTCACATTCACAG
GAGGCGT
CCTCAGACTCAGAT
TCGGATA
CTGACCA
TTTCATT
GTTCCAATTCCAAC
CGTAGCG
ACTCCCCCTCCCCG
GTCATGCTCATGCA
TTTCCGC
TTTCCTCTTCCTCG
TTTTGCC
TTTCGAC
GCCACCT
CTTCGGATTCGGAC
GTTCGGGTTCGGGC
CAACTTTAACTTTA
GTCAAAATCAAAAA
GTCATTC
AAAACAAAAACAAT
GTTGGGG
CCTGAAC
CTGATGT
AGTGCGC
CTTGCGATTGCGAG
GGCGAGT
CTGGCGG
ACTGGACCTGGACT
GCTGGATCTGGATT
ACTTTACCTTTACC
CTGGGAC
TTGGGGT
AACCAGCACCAGCT
AATTTCAATTTCAG
ATTGTGGTTGTGGC
TTTGTTG
TTGTTGA
CTTTAAATTTAAAC
CTTTAAC
ACTTAAGCTTAAGT
CCTAGGTCTAGGTC
CTTTTCTTTTTCTT
GGGACCCGGACCCG
AACAATTACAATTA
TTTTCAT
TTTTCGA
TTTTCGGTTTCGGC
AATTTGTATTTGTC
TTTTGCATTTGCAG
TGCCCACGCCCACA
GCTTGGCCTTGGCA
CCTTCTG
CCTTTAGCTTTAGC
TTTTATC
TTTTATG
CTTTTCA
TCAGTGC
TTTTTGC
TTTTTGG
CTTTTTC
"Island"
"Tip"
Error correction should remove most tips & islands; rest can be removed here, leveraging graph structure
Maternal
Paternal
GCATA
CATAT
TCGGT
CGGTA
CGGCA
GGCAT
GCGCC
CGCCG
ATGCG
TGCGC
ATATGGTATA
TATAT
GTAGT
TAGTC
GGTAT
AGTCT
TATGC
TCGGC
TCTCG
CTCGG
GTCTCGTAGTCTCGGCATATGCGCCG GTAGTCTCGGTATATGCGCCG
Maternal
Paternal
k=6
"Bubble"
Assembly alternatives
Overlap
Layout
Consensus
Error correction
De Bruijn graph
Scaffolding
Refine
Remove remaining "islands" “tips” and “bubbles” so that contigs are more obvious