24
Approaches for in silico finishing of microbial genome sequences Frederico Schmitt Kremer 1 , Alan John Alexander McBride 1 and Luciano da Silva Pinto 1 1 Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil. Abstract The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic informa- tion, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes re- mained as “drafts”, incomplete representations of the whole genetic content. The previous genome sequencing stud- ies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to opti- mize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing. Keywords: microbial genetics, molecular microbiology, genomics, microbiology, draft genomes. Received: September 25, 2016; Accepted: March 13, 2017. Introduction The advent of second generation of sequencing plat- forms, usually referred to as Next Generation Sequencing (NGS) technologies, promoted an expressive growth in the availability of genomic data in public databases, mainly due to the drastic reduction in the cost-per-base (Chain et al., 2009). Compared to the Sanger sequencing technique (Sanger et al., 1977), NGS platforms, like Illumina HiSeq, IonTorrent PGM, Roche 454 FLX, and ABI SOLiD, are able to generate a significantly higher throughput, resulting in a high sequencing coverage. However, they typically have a lower accuracy (in terms of average Phred-score in the raw data) and, in most cases, can only generate short- length reads (short-reads) (Liu et al., 2012). The term “short-read” is commonly used to refer to the data gener- ated by platforms such as Illumina, IonTorrent and SOLiD, as the length of their reads usually range from 30 bp (eg: SOLiD) to ~120 bp (e.g., Illumina HiSeq), which is smaller than the length usually obtained by Sanger sequencing (~1 kb), and by the PacBio and Oxford Nanopore platforms (commonly referred to as “long-reads”). The higher throughtput obtained by NGS has stimu- lated the development of new algorithms and tools that are capable of dealing with a larger volume of data and generat- ing genomic assemblies in a reasonable time. Traditional sequence assemblers, like Phrap (http://www.phrap.org/ phredphrap/) and CAP3 (Huang and Madan, 1999), were replaced with new ones, such as Velvet (Zerbino and Bir- ney, 2008), ABySS (Simpson et al., 2009), Ray (Boisvert et al., 2010), SPAdes (Bankevich et al., 2012) and SOAPdenovo (Luo et al., 2012), many of which were based on the De Bruijn graph algorithm (Pevzner et al., 2001; Compeau et al., 2011). A large number of these software offerings are capable of assembling data from different se- quencing platforms, called “hybrid assembly”; however, all of them exhibit some limitations and, even after several corrections and optimization steps, the generation of a fin- ished genome continues to represent a complex task (Alkan et al., 2011). Genome finishing is achieved by converting a set of contigs or scaffolds into complete sequences that represent the full genomic content of the organism without unknown regions (gaps) (Mardis et al., 2002), and aims a “closed” and full representation of the chromosome organization. On the contrary, a draft genome may be composed of a set of a few or several contigs or scaffolds (Land et al., 2015). Although useful for many studies, the unfinished and frag- mented nature of draft genomes may difficult analysis on comparative genomics and structural genomics (Ricker et al., 2012). In addition, some genes may be missed if located in a region without coverage (e.g., edges of contigs/scaf- folds), or due to assembly errors (e.g., repetitive regions that are “collapsed” into a single one) (Klassen and Currie, 2012). Genome finishing represents a relevant step to reduce data loss and lead to a more accurate representation of the genomics features of the organism of interest. The tradi- Genetics and Molecular Biology, 40, 3, 553-576 (2017) Copyright © 2017, Sociedade Brasileira de Genética. Printed in Brazil DOI: http://dx.doi.org/10.1590/1678-4685-GMB-2016-0230 Send correspondence to Frederico Schmitt Kremer. Centro de Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Campus Universitário, S/N, CEP 96160-000, Capão do Leão, RS, Brazil. E-mail: [email protected]. Review Article

Approaches for in silicofinishing of microbial genome ... · Approaches for in silicofinishing of microbial genome sequences Frederico Schmitt Kremer1, Alan John Alexander McBride1

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Approaches for in silico finishing of microbial genome sequences

    Frederico Schmitt Kremer1, Alan John Alexander McBride1 and Luciano da Silva Pinto1

    1Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de Desenvolvimento Tecnológico,

    Universidade Federal de Pelotas, Pelotas, Brazil.

    Abstract

    The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic informa-tion, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately,due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes re-mained as “drafts”, incomplete representations of the whole genetic content. The previous genome sequencing stud-ies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill thegaps, making the entire process very expensive. As such, several in silico approaches have been developed to opti-mize the genome assemblies and facilitate the finishing process. The present review aims to explore some free(open source, in many cases) tools that are available to facilitate genome finishing.

    Keywords: microbial genetics, molecular microbiology, genomics, microbiology, draft genomes.

    Received: September 25, 2016; Accepted: March 13, 2017.

    Introduction

    The advent of second generation of sequencing plat-forms, usually referred to as Next Generation Sequencing(NGS) technologies, promoted an expressive growth in theavailability of genomic data in public databases, mainlydue to the drastic reduction in the cost-per-base (Chain etal., 2009). Compared to the Sanger sequencing technique(Sanger et al., 1977), NGS platforms, like Illumina HiSeq,IonTorrent PGM, Roche 454 FLX, and ABI SOLiD, areable to generate a significantly higher throughput, resultingin a high sequencing coverage. However, they typicallyhave a lower accuracy (in terms of average Phred-score inthe raw data) and, in most cases, can only generate short-length reads (short-reads) (Liu et al., 2012). The term“short-read” is commonly used to refer to the data gener-ated by platforms such as Illumina, IonTorrent and SOLiD,as the length of their reads usually range from 30 bp (eg:SOLiD) to ~120 bp (e.g., Illumina HiSeq), which is smallerthan the length usually obtained by Sanger sequencing (~1kb), and by the PacBio and Oxford Nanopore platforms(commonly referred to as “long-reads”).

    The higher throughtput obtained by NGS has stimu-lated the development of new algorithms and tools that arecapable of dealing with a larger volume of data and generat-ing genomic assemblies in a reasonable time. Traditionalsequence assemblers, like Phrap (http://www.phrap.org/

    phredphrap/) and CAP3 (Huang and Madan, 1999), werereplaced with new ones, such as Velvet (Zerbino and Bir-ney, 2008), ABySS (Simpson et al., 2009), Ray (Boisvert etal., 2010), SPAdes (Bankevich et al., 2012) andSOAPdenovo (Luo et al., 2012), many of which were basedon the De Bruijn graph algorithm (Pevzner et al., 2001;Compeau et al., 2011). A large number of these softwareofferings are capable of assembling data from different se-quencing platforms, called “hybrid assembly”; however, allof them exhibit some limitations and, even after severalcorrections and optimization steps, the generation of a fin-ished genome continues to represent a complex task (Alkanet al., 2011).

    Genome finishing is achieved by converting a set ofcontigs or scaffolds into complete sequences that representthe full genomic content of the organism without unknownregions (gaps) (Mardis et al., 2002), and aims a “closed”and full representation of the chromosome organization.On the contrary, a draft genome may be composed of a setof a few or several contigs or scaffolds (Land et al., 2015).Although useful for many studies, the unfinished and frag-mented nature of draft genomes may difficult analysis oncomparative genomics and structural genomics (Ricker etal., 2012). In addition, some genes may be missed if locatedin a region without coverage (e.g., edges of contigs/scaf-folds), or due to assembly errors (e.g., repetitive regionsthat are “collapsed” into a single one) (Klassen and Currie,2012).

    Genome finishing represents a relevant step to reducedata loss and lead to a more accurate representation of thegenomics features of the organism of interest. The tradi-

    Genetics and Molecular Biology, 40, 3, 553-576 (2017)Copyright © 2017, Sociedade Brasileira de Genética. Printed in BrazilDOI: http://dx.doi.org/10.1590/1678-4685-GMB-2016-0230

    Send correspondence to Frederico Schmitt Kremer. Centro deDesenvolvimento Tecnológico, Universidade Federal de Pelotas,Campus Universitário, S/N, CEP 96160-000, Capão do Leão, RS,Brazil. E-mail: [email protected].

    Review Article

  • tional way to close gapped genomes includes: (1) designprimers based on the edge of adjacent contigs, (2) PCR am-plification, (3) Sanger sequencing, and (4) local assembly,usually with (5) manual curation. As this process is verytime consuming, new in vitro approaches were developedto speed it up, including multiplex PCR (Tettelin et al.,1999), optical mapping (Latreille et al., 2007), and hybridsequencing and assembly (Ribeiro et al., 2012). This, how-ever, usually results in a drastic increase in the cost of thesequencing project. Therefore it is not surprising that fromthe 87,956 prokaryote genomes available in GenBank untilDecember 2016 (GenBank release 217), only a small frac-tion (~6,586) is finished (http://www.ncbi.nlm.nih.gov/genbank/).

    In the last years, the availability of third-generationsequencing technologies, such as PacBio SMRT and Ox-ford Nanopore, has also provided another way to achievefinished genomes (Ribeiro et al., 2012). As these platformsusually generate reads with length > 10 kb, the assembly al-gorithms have to deal with less ambiguities and problem-atic regions (Koren and Phillippy, 2015). This makes thereconstruction of the chromosome sequences easier, butjust like second generation technologies, both platformshave their own limitations. In the earlier versions of thePacBio SMRT platform, for example, it used to present ahigh error-rate, so it was recommended to correct part ofthese errors using short-reads data (e.g., Illumina) (Koren etal., 2012; Au et al., 2012; Salmela and Rivals, 2014) beforeusing its result for any analysis. Although this problem hasbeen minimized with the improvements in the sequencingchemistry and base-calling process over the last years, thetime required for the sequencing, the cost of each run, andthe price of the equipment itself are still drawbacks, andthese are some of reasons as to why it is not more frequentlyapplied. In the case of Oxford Nanopore, there is a limitednumber of tools that can be used to pre-process and analyzeits data, and some types of errors (e.g., under-represen-tation of specific k-mers) are still recurrent (Deschamps etal., 2016). Therefore, second-generation platforms are stillthe most popular ones for a wide variety of applications,and remain as the cheaper alternative to obtain genomicdata.

    Previous reviews that focused on genome assemblyand whole-genome sequencing analysis have already de-scribed some of the in silico tools that have the potential toimprove a genome assembly without the need for experi-mental data. Edwards and Holt (2013), Dark (2013) andVincent et al. (2016) have provided very comprehensivedescriptions of some of the main steps of data analysis inmicrobial genomes sequenced by next-generation technol-ogies, including de novo assembly, reference-based contigordering, genome annotation and comparative genomics,and these are useful starting points for those that are new tomicrobial genomics with NGS. Nagarajan et al. (2010)have also made some useful considerations regarding ge-

    nome assembly and presented some methodologies for ge-nome finishing, but limited to a small set of approaches thatthey have developed to improve assemblies. Ramos et al.(2013) exemplified some assembly and post-assembly me-thods for genomes sequenced with IonTorrent platforms.Finally, Hunt et al. (2014) have also made a complete andaccurate benchmark analysis of different tools for assemblyscaffolding based on paired-end/mate-pair data. However,although very instructive, these reviews provide descrip-tions of some specific procedures, but do not give an in-depth view of the different finishing strategies, nor of thealgorithms that are used in background by different toolsfrom each category. Therefore, a more complete descrip-tion of these tools may be useful, especially for those re-searchers that are starting to work with microbial genomeanalysis and want to achieve better assemblies without re-lying on re-sequencing or manual gap-closing.

    The present review aims at discussing some of thecategories of software that may be applied in the process ofassembly finishing, especially in the context of microbialgenome sequencing projects. A particular focus is placedon those that do not require additional experimental and/ornew sequencing data. The review intentionally focuses ontools that are freely available, at least for academic use, andomits proprietary software, like the CLC Genomics Work-bench (http://www.clcbio.com/), for example. This was notdue to anticipated differences in the performance or qualityof the results, but to adhere to the original intention: anoverview of tools that could be used by most researchers.As different approaches have been developed to improvegenomic assemblies, the description of these programs wasdivided into four main categories: scaffolding (placementof contigs into larger sequences by using experimental dataor information for closely-related genomes, and joiningthem by gap regions), assembly integration (generation of aconsensus assembly using multiple assemblies for the samegenome), gap closing (solving gaps by identifying theircorrect sequence), error correction (removal of misidenti-fied bases or misassembled regions) and assembly evalua-tion (quantification of the reliability of a genome assemblyand identification of its erroneous regions) (Table 1). Thesedifferent categories of programs can be combined, accord-ing to the type of data that is available (e.g., sequencingplatform and library that was used), and may help to reducethe fragmentation and improve the reliability of a genomeassembly. Based on the categories of software that are dis-cussed in the present review we have created the flowchartshowed in Figure 1, which may of help in choosing themost appropriate approach for genome finishing, depend-ing on the type of the data that is available. Examples of ge-nome projects that used some of the tools discussed in thepresent review are shown in Supplementary material (Ta-ble S1).

    554 Kremer et al.

  • Finishing microbial genomes 555

    Tab

    le1

    -O

    verv

    iew

    ofth

    eto

    ols

    desc

    ribe

    din

    the

    pres

    ent

    revi

    ew

    Cat

    egor

    yT

    ool

    Mai

    nfe

    atur

    esD

    epen

    denc

    es*

    Ref

    eren

    ceD

    ownl

    oad

    link

    /w

    ebse

    rver

    Sca

    ffol

    ding

    AB

    ySS

    -P

    aire

    d-en

    dsc

    affo

    ldin

    g.-

    Sca

    ffol

    ding

    feat

    ure

    alre

    ady

    inte

    grat

    edin

    the

    AB

    ySS

    de

    novo

    asse

    mbl

    ypi

    peli

    ne.

    -U

    ses

    the

    esti

    mat

    eddi

    stan

    ces

    gene

    rate

    dby

    the

    prog

    ram

    Dis

    tanc

    eEst

    (fro

    mth

    esa

    me

    pack

    age)

    asin

    put.

    -A

    llow

    sth

    esc

    affo

    ldin

    gus

    ing

    long

    -rea

    ds,s

    uch

    asth

    ose

    gen-

    erat

    edby

    Pac

    Bio

    and

    Oxf

    ord

    Nan

    opor

    epl

    atfo

    rms.

    boos

    tli

    brar

    ies:

    ww

    w.b

    oost

    .org

    /O

    pen

    MP

    I:ht

    tp:/

    /ww

    w.o

    pen-

    mpi

    .org

    spar

    se-h

    ash

    libr

    ary:

    http

    ://g

    oog-

    spar

    seha

    sh.s

    ourc

    efor

    ge.n

    et/

    (Sim

    pson

    etal.

    ,20

    09)

    http

    ://w

    ww

    .bcg

    sc.c

    a/pl

    at-

    form

    /bio

    info

    /sof

    twar

    e/ab

    yss

    Sca

    ffol

    ding

    Bam

    bus

    2-

    Pai

    red-

    end

    scaf

    fold

    ing.

    -C

    anbe

    easi

    lyin

    tegr

    ated

    wit

    has

    sem

    bly

    proj

    ects

    that

    are

    buil

    ton

    top

    ofth

    eA

    MO

    Spa

    ckag

    e.-

    Sup

    port

    sth

    esc

    affo

    ldin

    gof

    met

    agen

    omes

    .-

    Req

    uire

    sex

    peri

    ence

    wit

    hth

    eA

    MO

    Spa

    ckag

    ean

    dit

    sda

    tafo

    rmat

    s.

    AM

    OS

    pack

    age

    (Tre

    ange

    net

    al.

    ,20

    11):

    http

    ://a

    mos

    .sou

    rcef

    orge

    .net

    /(K

    oren

    etal.

    ,20

    11)

    http

    s://

    sour

    cefo

    rge.

    net/

    pro-

    ject

    s/am

    os/

    Sca

    ffol

    ding

    MIP

    -P

    aire

    d-en

    dsc

    affo

    ldin

    g.-

    Sup

    port

    sbo

    thpa

    ired

    -end

    and

    mat

    e-pa

    ir(l

    ong

    rang

    e)re

    ads.

    lpso

    lve

    libr

    ary:

    http

    ://s

    ourc

    efor

    ge.n

    et/p

    roje

    cts/

    lpso

    lve/

    lem

    onli

    -br

    ary:

    http

    ://l

    emon

    .cs.

    elte

    .hu/

    (Sal

    mel

    aet

    al.

    ,20

    11)

    http

    s://

    ww

    w.c

    s.he

    l-si

    nki.

    fi/u

    /lm

    salm

    el/m

    ip-s

    caff

    old

    er/

    Sca

    ffol

    ding

    OP

    ER

    A-

    Pai

    red-

    end

    scaf

    fold

    ing.

    -Id

    enti

    fies

    pote

    ntia

    lsp

    urio

    usco

    nnec

    tion

    sca

    used

    bych

    ime-

    ric

    read

    san

    dre

    peti

    tive

    geno

    mic

    sel

    emen

    tsth

    atm

    ayaf

    fect

    the

    reli

    abil

    ity

    ofth

    esc

    affo

    ldin

    g.-

    Con

    tigs

    iden

    tifi

    edas

    mis

    asse

    mbl

    edm

    aybe

    used

    inth

    eco

    n-st

    ruct

    ion

    ofm

    ore

    than

    one

    scaf

    fold

    ,but

    som

    etim

    esit

    may

    lead

    tone

    was

    sem

    bly

    erro

    rs.

    BW

    A(L

    ian

    dD

    urbi

    n20

    09):

    http

    ://b

    io-b

    wa.

    sour

    cefo

    rge.

    net/

    Bow

    tie

    (Lan

    gmea

    det

    al.

    ,20

    09):

    http

    ://b

    owti

    e-bi

    o.so

    urce

    forg

    e.ne

    t/S

    amto

    ols

    (Li

    et

    al.

    ,20

    09):

    http

    ://s

    amto

    ols.

    sour

    cefo

    rge.

    net/

    (Gao

    etal.

    ,20

    11)

    http

    s://

    sour

    cefo

    rge.

    net/

    pro-

    ject

    s/op

    eras

    f

    Sca

    ffol

    ding

    SC

    AR

    PA

    -P

    aire

    d-en

    dsc

    affo

    ldin

    g.-

    Onl

    yus

    esfo

    rsc

    affo

    ldin

    gth

    ose

    cont

    igs

    wit

    hle

    ngth

    grea

    ter

    than

    the

    N50

    ofth

    eas

    sem

    bly.

    -A

    llow

    sm

    ulti

    ple

    libr

    arie

    sto

    beus

    edin

    the

    sam

    esc

    affo

    ldin

    gpr

    ojec

    t.

    Non

    e(D

    onm

    ezan

    dB

    rudn

    o,20

    13)

    http

    ://c

    ompb

    io.c

    s.to

    -ro

    nto.

    edu/

    haps

    embl

    er/s

    carp

    a.ht

    ml

    Sca

    ffol

    ding

    SG

    A-

    Pai

    red-

    end

    scaf

    fold

    ing.

    -S

    caff

    oldi

    ngfe

    atur

    eal

    read

    yin

    tegr

    ated

    inth

    eS

    GA

    asse

    mbl

    ypi

    peli

    ne,w

    hich

    isop

    tim

    ized

    for

    Illu

    min

    ada

    taan

    dla

    rge

    geno

    mes

    .-

    Use

    sth

    ees

    tim

    ated

    dist

    ance

    sge

    nera

    ted

    byth

    epr

    ogra

    mD

    ista

    nceE

    st(f

    rom

    the

    pack

    age

    AB

    ySS

    )as

    inpu

    t,al

    ong

    wit

    hth

    ere

    adm

    appi

    ngfi

    lein

    .BA

    Mfo

    rmat

    .-

    All

    ows

    mul

    tipl

    eli

    brar

    ies

    tobe

    used

    inth

    esa

    me

    scaf

    fold

    ing

    proj

    ect.

    Bam

    tool

    s(B

    arne

    ttet

    al.

    ,20

    11):

    http

    s://

    gith

    ub.c

    om/p

    ezm

    aste

    r31/

    bam

    tool

    sB

    WA

    (Li

    and

    Dur

    bin,

    2009

    ):ht

    tp:/

    /bio

    -bw

    a.so

    urce

    forg

    e.ne

    t/S

    amto

    ols

    (Li

    et

    al.

    ,20

    09):

    http

    ://s

    amto

    ols.

    sour

    cefo

    rge.

    net/

    Spa

    rse-

    hash

    libr

    ary:

    http

    ://g

    oog-

    spar

    seha

    sh.s

    ourc

    efor

    ge.n

    et/

    (Sim

    pson

    and

    Dur

    bin,

    2012

    )ht

    tps:

    //gi

    thub

    .com

    /jts

    /sga

  • 556 Kremer et al.

    Cat

    egor

    yT

    ool

    Mai

    nfe

    atur

    esD

    epen

    denc

    es*

    Ref

    eren

    ceD

    ownl

    oad

    link

    /w

    ebse

    rver

    Sca

    ffol

    ding

    SO

    PR

    A-

    Pai

    red-

    end

    scaf

    fold

    ing.

    -D

    evel

    oped

    toim

    prov

    eth

    eas

    sem

    blie

    sge

    nera

    ted

    byV

    elve

    tan

    dS

    SA

    KE

    ,and

    requ

    ired

    the

    .AF

    Gfi

    les.

    -S

    uppo

    rts

    data

    from

    earl

    yIl

    lum

    ina

    and

    AB

    IS

    OL

    iDpl

    at-

    form

    s,in

    clud

    ing

    pair

    ed-e

    ndan

    dm

    ate-

    pair

    read

    s.-

    Isno

    tfu

    lly

    auto

    mat

    ed,s

    oit

    isne

    cess

    ary

    toru

    ndi

    ffer

    ent

    scri

    pts

    for

    each

    step

    ofth

    esc

    affo

    ldin

    g.

    Non

    e(D

    ayar

    ian

    etal.

    ,201

    0)ht

    tp:/

    /ww

    w.p

    hys-

    ics.

    rutg

    ers.

    edu/

    ~an

    irva

    ns/S

    OP

    RA

    /

    Sca

    ffol

    ding

    SS

    PA

    CE

    -P

    aire

    d-en

    dsc

    affo

    ldin

    g.-

    Tri

    ms

    the

    edge

    ofth

    eco

    ntig

    sas

    they

    are

    mor

    esu

    itab

    leto

    asse

    mbl

    yer

    rors

    .-

    Req

    uire

    sin

    form

    atio

    nab

    out

    the

    pair

    ed-e

    ndli

    brar

    y,in

    clud

    -in

    gm

    ean

    size

    ofth

    ein

    sert

    ,sta

    ndar

    dde

    viat

    ion

    and

    the

    rela

    -ti

    veor

    ient

    atio

    nof

    the

    mat

    es.

    Non

    e(B

    oetz

    eret

    al.

    ,20

    11)

    http

    ://w

    ww

    .bas

    ecle

    ar.c

    om/g

    eno

    mic

    s/bi

    oinf

    orm

    atic

    s/ba

    seto

    ols/

    Sca

    ffol

    ding

    SS

    PA

    CE

    -L

    ongR

    ead

    -P

    aire

    d-en

    dsc

    affo

    ldin

    g.-

    All

    ows

    the

    scaf

    fold

    ing

    usin

    glo

    ng-r

    eads

    ,suc

    has

    thos

    ege

    n-er

    ated

    byP

    acB

    ioan

    dO

    xfor

    dN

    anop

    ore

    plat

    form

    s.

    Non

    e(B

    oetz

    eran

    dP

    irov

    ano,

    2014

    )ht

    tp:/

    /ww

    w.b

    asec

    lear

    .com

    /gen

    om

    ics/

    bioi

    nfor

    mat

    ics/

    base

    tool

    s/

    Sca

    ffol

    ding

    MU

    Mm

    er-

    Sin

    gle

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -T

    here

    sult

    ofth

    eal

    ignm

    ent

    mus

    tbe

    post

    -pro

    cess

    edto

    ob-

    tain

    the

    scaf

    fold

    s.

    (Kur

    tzet

    al.

    ,20

    04)

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    Sca

    ffol

    ding

    AB

    AC

    AS

    -S

    ingl

    ere

    fere

    nce-

    base

    dsc

    affo

    ldin

    g.-

    Use

    ful

    whe

    nth

    ere

    fere

    nce

    and

    the

    targ

    etge

    nom

    ear

    ecl

    osel

    y-re

    late

    d,an

    dth

    ege

    nom

    eto

    besc

    affo

    lded

    isno

    tla

    rger

    than

    the

    refe

    renc

    ege

    nom

    e.-

    Not

    opti

    miz

    edfo

    rba

    cter

    iaw

    ith

    two

    orm

    ore

    repl

    icon

    s/ch

    rom

    osom

    es(e

    x:L

    epto

    spir

    age

    nus)

    .-

    All

    ows

    the

    desi

    gnof

    prim

    ers

    for

    gap-

    clos

    ing.

    MU

    Mm

    er(K

    urtz

    etal.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    Pri

    mer

    3(K

    ores

    saar

    and

    Rem

    m,2

    007;

    Unt

    erga

    sser

    etal.

    ,20

    12):

    http

    ://p

    rim

    er3.

    ut.e

    e/

    (Ass

    efa

    etal.

    ,200

    9)ht

    tp:/

    /aba

    cas.

    sour

    cefo

    rge.

    net/

    Sca

    ffol

    ding

    CO

    NT

    IGua

    tor

    -Sin

    gle

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -Use

    ful

    whe

    nth

    eta

    rget

    geno

    me

    isco

    mpo

    sed

    bym

    ore

    than

    one

    chro

    mos

    ome

    /re

    plic

    on.

    -A

    llow

    sa

    mor

    ese

    nsit

    ive

    iden

    tifi

    cati

    onof

    synt

    enic

    re-

    gion

    s,if

    com

    pare

    dto

    AB

    AC

    AS

    ,as

    itap

    plie

    sa

    BL

    AS

    Tse

    arch

    afte

    rM

    UM

    mm

    er.

    AB

    AC

    AS

    (Ass

    efa

    etal.

    ,200

    9):

    http

    ://a

    baca

    s.so

    urce

    forg

    e.ne

    t/B

    ioP

    ytho

    n(P

    ytho

    npa

    ckag

    e):

    http

    ://b

    iopy

    thon

    .org

    /B

    LA

    ST

    +(A

    ltsc

    hul

    etal.

    ,19

    90;

    Cam

    acho

    et

    al.

    ,20

    09):

    ftp:

    //ft

    p.nc

    bi.n

    lm.n

    ih.g

    ov/b

    last

    /M

    UM

    mer

    (Kur

    tzet

    al.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    Pri

    mer

    3(K

    ores

    saar

    and

    Rem

    m,2

    007;

    Unt

    erga

    sser

    etal.

    ,20

    12):

    http

    ://p

    rim

    er3.

    ut.e

    e/

    (Gal

    ardi

    niet

    al.

    ,20

    11)

    http

    ://c

    onti

    guat

    or.s

    ourc

    efor

    ge.n

    et/

    Tab

    le1

    -co

    ntin

    ued

  • Finishing microbial genomes 557

    Cat

    egor

    yT

    ool

    Mai

    nfe

    atur

    esD

    epen

    denc

    es*

    Ref

    eren

    ceD

    ownl

    oad

    link

    /w

    ebse

    rver

    Sca

    ffol

    ding

    Mau

    ve-

    Sin

    gle

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -C

    anbe

    used

    both

    thro

    ugh

    aco

    mm

    andl

    ine

    inte

    rfac

    e(C

    LI)

    and

    agr

    aphi

    cal

    user

    inte

    rfac

    e(G

    UI)

    .-

    All

    ows

    the

    iden

    tifi

    cati

    onof

    geno

    mic

    inve

    rsio

    nsan

    dtr

    ansl

    ocat

    ions

    .-

    Not

    opti

    miz

    edfo

    rba

    cter

    iaw

    ith

    two

    orm

    ore

    repl

    icon

    s/ch

    rom

    osom

    es.

    Java

    :ht

    tps:

    //w

    ww

    .jav

    a.co

    m/

    (Dar

    ling

    etal.

    ,200

    4;R

    issm

    anet

    al.

    ,20

    09)

    http

    ://d

    arli

    ngla

    b.or

    g/m

    auve

    /ma

    uve.

    htm

    l

    Sca

    ffol

    ding

    Fil

    lSca

    ffol

    ds-

    Sin

    gle

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -N

    otop

    tim

    ized

    for

    bact

    eria

    wit

    htw

    oor

    mor

    ere

    plic

    ons/

    chro

    mos

    omes

    .-

    Res

    ults

    may

    requ

    ire

    post

    -pro

    cess

    ing

    tore

    cons

    truc

    tth

    ese

    -qu

    ence

    ofth

    esc

    affo

    ld.

    Java

    :ht

    tps:

    //w

    ww

    .jav

    a.co

    m/

    (Muñ

    ozet

    al.

    ,20

    10)

    Sup

    plem

    enta

    ryda

    taof

    Muñ

    ozet

    al.

    (201

    0).

    http

    ://d

    x.do

    i.or

    g/10

    .118

    6/14

    71-

    2105

    -11-

    304

    Sca

    ffol

    ding

    SIS

    -S

    ingl

    ere

    fere

    nce-

    base

    dsc

    affo

    ldin

    g.-

    All

    ows

    the

    iden

    tifi

    cati

    onof

    geno

    mic

    inve

    rsio

    ns.

    -N

    otop

    tim

    ized

    for

    bact

    eria

    wit

    htw

    oor

    mor

    ere

    plic

    ons/

    chro

    mos

    omes

    .

    MU

    Mm

    er(K

    urtz

    etal.

    ,200

    4):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (Dia

    set

    al.

    ,201

    2)ht

    tp:/

    /mar

    te.i

    c.un

    icam

    p.br

    :874

    7.

    Sca

    ffol

    ding

    CA

    R-

    Sin

    gle

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -A

    llow

    sth

    eid

    enti

    fica

    tion

    ofge

    nom

    icin

    vers

    ions

    and

    tran

    sloc

    atio

    ns.

    -A

    lso

    avai

    labl

    eas

    aw

    ebse

    rver

    .-

    Not

    opti

    miz

    edfo

    rba

    cter

    iaw

    ith

    two

    orm

    ore

    repl

    icon

    s/ch

    rom

    osom

    es.

    MU

    Mm

    er(K

    urtz

    etal.

    ,200

    4):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    PH

    P:

    http

    s://

    php.

    net/

    (Lu

    etal.

    ,20

    14)

    http

    ://g

    e-no

    me.

    cs.n

    thu.

    edu.

    tw/C

    AR

    /

    Sca

    ffol

    ding

    RA

    CA

    -M

    ulti

    ple

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -O

    ptim

    ized

    for

    larg

    ege

    nom

    esan

    dw

    ith

    mul

    tipl

    ech

    rom

    o-so

    mes

    .-

    Can

    also

    use

    pair

    ed-e

    ndda

    ta.

    Non

    e(K

    imet

    al.

    ,20

    13):

    http

    ://b

    ioen

    -com

    pbio

    .bio

    en.i

    lli-

    nois

    .edu

    /RA

    CA

    /

    Sca

    ffol

    ding

    Rag

    out

    -M

    ulti

    ple

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -U

    ses

    phyl

    ogen

    etic

    info

    rmat

    ion

    toid

    enti

    fyth

    em

    ost

    prob

    a-bl

    eor

    ient

    atio

    nof

    the

    cont

    igs

    /sc

    affo

    lds.

    Net

    wor

    kx(P

    ytho

    npa

    ckag

    e):

    http

    ://n

    etw

    orkx

    .git

    hub.

    io/

    New

    ick

    (Pyt

    hon

    pack

    age)

    :ht

    tp:/

    /ww

    w.d

    aim

    i.au

    .dk/

    ~m

    ailu

    nd/n

    ewic

    k.ht

    ml

    Sib

    elia

    :ht

    tp:/

    /git

    hub.

    com

    /bio

    inf/

    Sib

    elia

    (Kol

    mog

    orov

    et

    al.

    ,20

    14)

    http

    s://

    gith

    ub.c

    om/f

    ende

    rgla

    ss/

    Rag

    out

    Sca

    ffol

    ding

    MeD

    uSa

    -M

    ulti

    ple

    refe

    renc

    e-ba

    sed

    scaf

    fold

    ing.

    -A

    ccep

    tsbo

    thfi

    nish

    edan

    ddr

    aft

    geno

    mes

    asre

    fere

    nce.

    Bio

    Pyt

    hon

    (Pyt

    hon

    pack

    age)

    :ht

    tp:/

    /bio

    pyth

    on.o

    rg/

    Java

    :ht

    tps:

    //w

    ww

    .jav

    a.co

    m/

    MU

    Mm

    er(K

    urtz

    etal.

    ,200

    4):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (Bos

    iet

    al.

    ,20

    15)

    http

    s://

    gith

    ub.c

    om/c

    ombo

    geno

    mic

    s/m

    edus

    a

    Ass

    embl

    yin

    -te

    grat

    ion

    Min

    imus

    -C

    anbe

    easi

    lyin

    tegr

    ated

    wit

    has

    sem

    bly

    proj

    ects

    that

    are

    buil

    ton

    top

    ofth

    eA

    MO

    Spa

    ckag

    e.-

    Req

    uire

    sex

    peri

    ence

    wit

    hth

    eA

    MO

    Spa

    ckag

    ean

    dit

    sda

    tafo

    rmat

    s.

    AM

    OS

    pack

    age

    (Tre

    ange

    net

    al.

    ,20

    11):

    http

    ://a

    mos

    .sou

    rcef

    orge

    .net

    /(S

    omm

    eret

    al.

    ,20

    07)

    http

    s://

    sour

    cefo

    rge.

    net/

    pro-

    ject

    s/am

    os/

    Tab

    le1

    -co

    ntin

    ued

  • 558 Kremer et al.

    Cat

    egor

    yT

    ool

    Mai

    nfe

    atur

    esD

    epen

    denc

    es*

    Ref

    eren

    ceD

    ownl

    oad

    link

    /w

    ebse

    rver

    Ass

    embl

    yin

    -te

    grat

    ion

    Rec

    onci

    liat

    or-

    Cor

    rect

    sth

    em

    isas

    sem

    bled

    regi

    ons

    ina

    targ

    etas

    sem

    bly

    byco

    mpa

    ring

    toan

    alte

    rnat

    ive

    asse

    mbl

    yfo

    rth

    esa

    me

    geno

    me.

    -Id

    enti

    fies

    repe

    titi

    vere

    gion

    sth

    atsu

    ffer

    edco

    mpr

    essi

    ons

    orex

    pans

    ions

    .

    MU

    Mm

    er(K

    urtz

    etal.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (Zim

    inet

    al.

    ,20

    08)

    http

    ://w

    ww

    .gen

    ome.

    umd.

    edu/

    Ass

    embl

    yin

    -te

    grat

    ion

    MA

    IA-

    All

    ows

    the

    inte

    grat

    ion

    oftw

    oor

    mor

    eas

    sem

    blie

    s.-

    Acc

    epts

    refe

    renc

    ege

    nom

    eto

    perf

    orm

    scaf

    fold

    ing,

    wha

    tis

    usef

    ulfo

    rth

    ose

    cont

    igs

    wit

    hout

    corr

    espo

    nden

    cein

    the

    othe

    ras

    sem

    blie

    s.

    Mat

    lab:

    http

    s://

    ww

    w.m

    athw

    orks

    .com

    /M

    UM

    mer

    :ht

    tp:/

    /mum

    mer

    .sou

    rcef

    orge

    .net

    /G

    AIM

    C(M

    atla

    bto

    olbo

    x):

    http

    ://g

    ithu

    b.co

    m/d

    glei

    ch/g

    aim

    c

    (Nij

    kam

    pet

    al.

    ,20

    10)

    http

    ://b

    ioin

    form

    atic

    s.tu

    delf

    t.nl

    Ass

    embl

    yin

    -te

    grat

    ion

    CIS

    A-

    All

    ows

    the

    inte

    grat

    ion

    ofth

    ree

    orm

    ore

    asse

    mbl

    ies.

    -C

    orre

    cts

    mis

    asse

    mbl

    edre

    gion

    san

    dco

    mpr

    esse

    d/

    expa

    nded

    repe

    ated

    regi

    ons.

    BL

    AS

    T+

    (Alt

    schu

    let

    al.

    ,199

    0;C

    amac

    hoet

    al.

    ,20

    09):

    ftp:

    //ft

    p.nc

    bi.n

    lm.n

    ih.g

    ov/b

    last

    /M

    UM

    mer

    (Kur

    tzet

    al.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (Lin

    and

    Lia

    o,20

    13)

    http

    ://s

    b.nh

    ri.o

    rg.t

    w/C

    ISA

    /

    Ass

    embl

    yin

    -te

    grat

    ion

    GA

    A-

    Use

    sth

    eal

    ignm

    ent

    betw

    een

    the

    diff

    eren

    tco

    ntig

    sin

    the

    set

    ofas

    sem

    blie

    sto

    gene

    rate

    anas

    sem

    bly

    grap

    h,w

    hich

    isex

    -pl

    ored

    toid

    enti

    fyto

    min

    imal

    set

    ofin

    depe

    nden

    tpa

    ths.

    BL

    AT

    (Ken

    t,20

    02):

    http

    s://

    geno

    me.

    ucsc

    .edu

    /G

    SM

    appe

    r:ht

    tp:/

    /454

    .com

    /

    (Yao

    etal.

    ,20

    12)

    http

    ://s

    ourc

    efor

    ge.n

    et/p

    ro-

    ject

    s/ga

    a-w

    ugi/

    Ass

    embl

    yin

    -te

    grat

    ion

    Mix

    -G

    ener

    ate

    anex

    tens

    ion

    grap

    hth

    atre

    pres

    ents

    the

    conn

    ecti

    onbe

    twee

    nth

    eco

    ntig

    s.-

    Fil

    ters

    the

    alig

    nmen

    tto

    redu

    ceth

    eam

    bigu

    itie

    sca

    used

    byre

    peti

    tive

    sequ

    ence

    s.

    Net

    wor

    kx(P

    ytho

    npa

    ckag

    e):

    http

    ://n

    etw

    orkx

    .lan

    l.go

    v/B

    ioP

    ytho

    n(P

    ytho

    npa

    ckag

    e):

    http

    ://b

    iopy

    thon

    .org

    /M

    UM

    mer

    (Kur

    tzet

    al.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (Sou

    eida

    net

    al.

    ,20

    13)

    http

    s://

    gith

    ub.c

    om/c

    bib/

    MIX

    Ass

    embl

    yin

    -te

    grat

    ion

    GA

    M/

    GA

    M-N

    GS

    -R

    equi

    res

    the

    read

    file

    sto

    perf

    orm

    the

    asse

    mbl

    yin

    tegr

    atio

    n.-

    One

    ofth

    eas

    sem

    blie

    sto

    bem

    erge

    dis

    defi

    ned

    as“m

    as-

    ter”

    ,whi

    leth

    eot

    hers

    are

    defi

    ned

    as“s

    lave

    s”.

    -A

    llow

    sth

    eid

    enti

    fica

    tion

    ofm

    isas

    sem

    bled

    regi

    ons

    inth

    em

    aste

    r,w

    hich

    are

    corr

    ecte

    dbe

    fore

    the

    gene

    rati

    onof

    the

    fina

    las

    sem

    bly.

    cmak

    e:ht

    tps:

    //cm

    ake.

    org/

    zlib

    libr

    ary:

    http

    ://w

    ww

    .zli

    b.ne

    t/bo

    ost

    libr

    arie

    s:w

    ww

    .boo

    st.o

    rg/

    spar

    se-h

    ash

    libr

    ary:

    http

    ://g

    oog-

    spar

    seha

    sh.s

    ourc

    efor

    ge.n

    et/

    (Cas

    agra

    nde

    et

    al.

    ,20

    09;

    Vic

    edom

    ini

    etal.

    ,20

    13)

    http

    s://

    gith

    ub.c

    om/v

    ice8

    7/ga

    m-

    ngs

    Ass

    embl

    yin

    -te

    grat

    ion

    Zor

    ro-

    Req

    uire

    sth

    ere

    adfi

    les

    tope

    rfor

    mth

    eas

    sem

    bly

    inte

    grat

    ion.

    -R

    emap

    sth

    ere

    ads

    back

    toth

    eco

    ntig

    san

    did

    enti

    fies

    mis

    asse

    mbl

    edan

    dre

    peti

    tive

    regi

    ons

    base

    don

    the

    cove

    rage

    .-

    Spl

    its

    the

    mis

    asse

    mbl

    edco

    ntig

    san

    dpe

    rfor

    ms

    the

    asse

    mbl

    yin

    tegr

    atio

    nus

    ing

    Min

    imus

    .

    AM

    OS

    (Tre

    ange

    net

    al.

    ,20

    11):

    http

    ://a

    mos

    .sou

    rcef

    orge

    .net

    /B

    ioP

    erl

    (Per

    lm

    odul

    e):

    http

    ://b

    iope

    rl.o

    rgB

    owti

    e(L

    angm

    ead

    etal.

    ,20

    09):

    http

    ://b

    owti

    e-bi

    o.so

    urce

    forg

    e.ne

    t/M

    UM

    mer

    (Kur

    tzet

    al.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (Arg

    ueso

    etal.

    ,20

    09)

    http

    ://l

    ge.i

    bi.u

    nica

    mp.

    br/z

    orro

    /

    Tab

    le1

    -co

    ntin

    ued

  • Finishing microbial genomes 559

    Cat

    egor

    yT

    ool

    Mai

    nfe

    atur

    esD

    epen

    denc

    es*

    Ref

    eren

    ceD

    ownl

    oad

    link

    /w

    ebse

    rver

    Gap

    clos

    ing

    Gap

    Clo

    ser

    -G

    ap-c

    losi

    ngfe

    atur

    eal

    read

    yin

    tegr

    ated

    inth

    eS

    OA

    Pde

    novo

    de

    novo

    asse

    mbl

    ypi

    peli

    ne-

    Per

    form

    sa

    loca

    lre

    asse

    mbl

    yin

    the

    gap

    regi

    onus

    ing

    the

    read

    slo

    cate

    din

    the

    edge

    sof

    the

    surr

    ound

    ing

    cont

    igs.

    Non

    e(L

    iet

    al.

    ,20

    10)

    http

    ://s

    oap.

    geno

    mic

    s.or

    g.cn

    /

    Gap

    clos

    ing

    IMA

    GE

    -It

    erat

    ivel

    ype

    rfor

    ms

    are

    map

    ping

    ofth

    ere

    ads

    toth

    eco

    ntig

    s,fo

    llow

    edby

    the

    sele

    ctio

    nof

    thos

    eth

    atov

    erla

    pth

    ega

    pre

    gion

    and

    alo

    cal

    reas

    sem

    bly.

    Non

    e(T

    sai

    etal.

    ,20

    10)

    http

    s://

    sour

    cefo

    rge.

    net/

    pro-

    ject

    s/im

    age2

    Gap

    clos

    ing

    Gap

    Fil

    ler

    -It

    erat

    ivel

    ype

    rfor

    ms

    are

    map

    ping

    ofth

    ere

    ads

    toth

    eco

    ntig

    s,fo

    llow

    edby

    the

    sele

    ctio

    nof

    thos

    eth

    atov

    erla

    pth

    ega

    pre

    gion

    and

    alo

    cal

    reas

    sem

    bly.

    -R

    equi

    res

    info

    rmat

    ion

    abou

    tth

    epa

    ired

    -end

    libr

    ary,

    incl

    ud-

    ing

    mea

    nsi

    zeof

    the

    inse

    rt,i

    tsst

    anda

    rdde

    viat

    ion

    and

    the

    rel-

    ativ

    eor

    ient

    atio

    nof

    the

    mat

    es.

    Non

    e(B

    oetz

    eran

    dP

    irov

    ano,

    2012

    )ht

    tp:/

    /ww

    w.b

    asec

    lear

    .com

    /gen

    om

    ics/

    bioi

    nfor

    mat

    ics/

    base

    tool

    s

    Gap

    clos

    ing

    Enl

    y-

    Iter

    ativ

    ely

    perf

    orm

    sa

    rem

    appi

    ngof

    the

    read

    sto

    the

    cont

    igs,

    foll

    owed

    byth

    ese

    lect

    ion

    ofth

    ose

    that

    over

    lap

    the

    gap

    regi

    onan

    da

    loca

    lre

    asse

    mbl

    y.-

    Ifa

    refe

    renc

    ege

    nom

    eis

    prov

    ided

    ,ane

    wsc

    affo

    ldin

    gst

    epca

    nbe

    perf

    orm

    edto

    impr

    ove

    the

    asse

    mbl

    y.

    Bio

    Pyt

    hon

    (Pyt

    hon

    pack

    age)

    :ht

    tp:/

    /bio

    pyth

    on.o

    rg/

    BL

    AS

    Tan

    dB

    LA

    ST

    +(A

    ltsc

    hul

    etal.

    ,19

    90;

    Cam

    acho

    etal.

    ,200

    9):

    ftp:

    //ft

    p.nc

    bi.n

    lm.n

    ih.g

    ov/b

    last

    /C

    dbfa

    sta/

    cdby

    ank:

    http

    ://c

    ompb

    io.d

    fci.

    harv

    ard.

    edu/

    tgi/

    soft

    war

    e/E

    MB

    OS

    S:

    http

    ://e

    mbo

    ss.s

    ourc

    efor

    ge.n

    et/

    Min

    imo

    asse

    mbl

    er(T

    rean

    gen

    etal.

    ,20

    11):

    http

    ://a

    mos

    .sou

    rcef

    orge

    .net

    /M

    UM

    mer

    (Kur

    tzet

    al.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    Phr

    ap:

    http

    ://w

    ww

    .phr

    ap.o

    rg/p

    hred

    phra

    pcon

    sed.

    htm

    l

    (Fon

    diet

    al.

    ,201

    4)ht

    tp:/

    /enl

    y.so

    urce

    forg

    e.ne

    t/

    Gap

    clos

    ing

    FG

    AP

    -U

    ses

    alte

    rnat

    ive

    asse

    mbl

    ies

    ofth

    eta

    rget

    geno

    me

    toid

    enti

    fyre

    gion

    sth

    atov

    erla

    pth

    ega

    p.M

    atla

    b:ht

    tps:

    //w

    ww

    .mat

    hwor

    ks.c

    om/

    (Pir

    oet

    al.

    ,20

    14)

    http

    ://w

    ww

    .bio

    info

    .ufp

    r.br

    /fga

    p/

    Gap

    clos

    ing

    Sea

    ler

    -P

    erfo

    rms

    alo

    cal

    re-a

    ssem

    bly

    ofth

    ega

    pre

    gion

    sus

    ing

    dif-

    fere

    ntse

    ttin

    gsof

    k-m

    er,w

    hat

    may

    help

    inth

    eso

    lvin

    gof

    re-

    gion

    sw

    ith

    repe

    titi

    vese

    quen

    ces.

    boos

    tli

    brar

    ies:

    ww

    w.b

    oost

    .org

    /sp

    arse

    -has

    hli

    brar

    y:ht

    tp:/

    /goo

    g-sp

    arse

    hash

    .sou

    rcef

    orge

    .net

    /O

    pen

    MP

    I:ht

    tp:/

    /ww

    w.o

    pen-

    mpi

    .org

    (Pau

    lino

    etal.

    ,201

    5)ht

    tps:

    //gi

    thub

    .com

    /bcg

    sc/a

    byss

    /tre

    e/se

    aler

    -rel

    ease

    Tab

    le1

    -co

    ntin

    ued

  • 560 Kremer et al.

    Cat

    egor

    yT

    ool

    Mai

    nfe

    atur

    esD

    epen

    denc

    es*

    Ref

    eren

    ceD

    ownl

    oad

    link

    /w

    ebse

    rver

    Gap

    clos

    ing

    GM

    CL

    oser

    -M

    ayus

    ebo

    thpa

    ired

    -end

    read

    san

    dal

    tern

    ativ

    eas

    sem

    blie

    sto

    perf

    orm

    the

    gap-

    clos

    ing.

    -A

    ppli

    esa

    like

    liho

    odan

    alys

    isto

    avoi

    dth

    eef

    fect

    ofm

    isas

    sem

    blie

    sin

    the

    alte

    rnat

    ive

    asse

    mbl

    ies.

    MU

    Mm

    er(K

    urtz

    etal.

    2004

    ):ht

    tp:/

    /mum

    mer

    .sou

    rcef

    orge

    .net

    /B

    LA

    ST

    +(A

    ltsc

    hul

    etal.

    ,199

    0;C

    amac

    hoet

    al.

    ,200

    9):

    ftp:

    //ft

    p.nc

    bi.n

    lm.n

    ih.g

    ov/b

    last

    /B

    owti

    e(L

    angm

    ead

    etal.

    ,200

    9):

    http

    ://b

    owti

    e-bi

    o.so

    urce

    forg

    e.ne

    t/Y

    AS

    S(N

    oéan

    dK

    uche

    rov,

    2005

    ):ht

    tp:/

    /bio

    info

    .lif

    l.fr

    /yas

    s

    (Kos

    ugi

    etal.

    ,20

    15)

    http

    s://

    sour

    cefo

    rge.

    net/

    pro-

    ject

    s/gm

    clos

    er/

    Gap

    clos

    ing

    Map

    Rep

    eat

    -P

    erfo

    rms

    are

    fere

    nce-

    base

    dsc

    affo

    ldin

    gus

    ing

    acl

    osel

    y-re

    late

    dge

    nom

    epr

    ovid

    edby

    the

    user

    .-

    Use

    sa

    refe

    renc

    e-gu

    ided

    asse

    mbl

    yto

    perf

    orm

    the

    gap-

    clos

    ing

    proc

    ess.

    BL

    AS

    T+

    (Alt

    schu

    let

    al.

    ,19

    90;

    Cam

    acho

    et

    al.

    ,200

    9):

    ftp:

    //ft

    p.nc

    bi.n

    lm.n

    ih.g

    ov/b

    last

    /B

    ioP

    ytho

    n(P

    ytho

    npa

    ckag

    e):

    http

    ://b

    iopy

    thon

    .org

    /M

    IRA

    :ht

    tp:/

    /mir

    a-as

    sem

    bler

    .sou

    rcef

    orge

    .net

    MU

    Mm

    er(K

    urtz

    etal.

    ,200

    4):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (Mar

    iano

    etal.

    ,201

    5)ht

    tp:/

    /git

    hub.

    com

    /dcb

    mar

    iano

    /m

    apre

    peat

    Gap

    clos

    ing

    Gap

    Bla

    ster

    -A

    llow

    sa

    man

    ual

    gap-

    clos

    ing

    usin

    gan

    alte

    rnat

    ive

    asse

    mbl

    yof

    the

    targ

    etge

    nom

    e.B

    LA

    ST

    and

    BL

    AS

    T+

    (Alt

    schu

    let

    al.

    ,199

    0;C

    amac

    hoet

    al.

    ,20

    09):

    ftp:

    //ft

    p.nc

    bi.n

    lm.n

    ih.g

    ov/b

    last

    /M

    UM

    mer

    (Kur

    tzet

    al.

    ,200

    4):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    (de

    etal.

    ,201

    6)ht

    tps:

    //so

    urce

    forg

    e.ne

    t/pr

    o-je

    cts/

    gapb

    last

    er20

    15/

    Ass

    embl

    yev

    alua

    tion

    RE

    AP

    R-

    Cal

    cula

    tes

    the

    accu

    racy

    ofth

    eas

    sem

    bly

    base

    don

    the

    cov-

    erag

    eaf

    ter

    rem

    appi

    ngth

    ere

    ads

    back

    toth

    esc

    affo

    lds.

    -M

    isas

    sem

    bled

    regi

    ons

    can

    beid

    enti

    fied

    asth

    eyus

    uall

    ypr

    esen

    ta

    disc

    repa

    ntco

    vera

    ge.

    -A

    new

    set

    ofsc

    affo

    lds

    isge

    nera

    ted

    bysp

    litt

    ing

    the

    regi

    ons

    iden

    tifi

    edas

    mis

    asse

    mbl

    ed.

    Fil

    e::B

    asen

    ame,

    Fil

    e::C

    opy,

    Fil

    e::S

    pec,

    Fil

    e::S

    pec:

    :Lin

    k,G

    etop

    t::L

    ong

    and

    Lis

    t::U

    til

    (Per

    lm

    odul

    es):

    http

    ://w

    ww

    .cpa

    n.or

    g/R

    :ht

    tps:

    //w

    ww

    .r-p

    roje

    ct.o

    rg/

    (Hun

    tet

    al.

    ,20

    13)

    http

    ://w

    ww

    .san

    ger.

    ac.u

    k/sc

    i-en

    ce/t

    ools

    /rea

    pr

    Ass

    embl

    yev

    alua

    tion

    QU

    AS

    T-

    Cal

    cula

    tese

    vera

    las

    sem

    bly

    met

    rics

    ,suc

    has

    C+

    G%

    ,N50

    and

    L50

    .-

    Can

    beus

    edto

    com

    pare

    diff

    eren

    tas

    sem

    blie

    sfo

    rth

    esa

    me

    geno

    me,

    and

    /or

    com

    pare

    then

    toa

    refe

    renc

    ege

    nom

    e.

    boos

    tli

    brar

    ies:

    ww

    w.b

    oost

    .org

    /Ja

    va:

    http

    s://

    ww

    w.j

    ava.

    com

    /M

    atpl

    otli

    b(P

    ytho

    npa

    ckag

    e):

    http

    ://m

    atpl

    otli

    b.or

    gT

    ime:

    :HiR

    es(P

    erl

    mod

    ule)

    :ht

    tp:/

    /ww

    w.c

    pan.

    org/

    (Gur

    evic

    het

    al.

    ,201

    3)ht

    tp:/

    /bio

    inf.

    spba

    u.ru

    /qua

    st

    Tab

    le1

    -co

    ntin

    ued

  • Finishing microbial genomes 561

    Cat

    egor

    yT

    ool

    Mai

    nfe

    atur

    esD

    epen

    denc

    es*

    Ref

    eren

    ceD

    ownl

    oad

    link

    /w

    ebse

    rver

    Ass

    embl

    yev

    alua

    tion

    AL

    E-

    Cal

    cula

    tes

    the

    accu

    racy

    ofth

    eas

    sem

    bly

    base

    don

    the

    k-m

    ers

    and

    C+

    G%

    dist

    ribu

    tion

    alon

    gth

    esc

    affo

    lds.

    -D

    oesn

    ’tre

    quir

    ea

    refe

    renc

    ege

    nom

    e.

    Mat

    plot

    lib

    (Pyt

    hon

    pack

    age)

    :ht

    tp:/

    /mat

    plot

    lib.

    org

    Mpm

    ath

    (Pyt

    hon

    pack

    age)

    :ht

    tp:/

    /mpm

    ath.

    org

    Num

    py(P

    ytho

    npa

    ckag

    e):

    http

    ://w

    ww

    .num

    py.o

    rgP

    ymix

    (Pyt

    hon

    pack

    age)

    :ht

    tp:/

    /ww

    w.p

    ymix

    .org

    /pym

    ixS

    etup

    tool

    s(P

    ytho

    npa

    ckag

    e):

    http

    s://

    gith

    ub.c

    om/p

    ypa/

    setu

    ptoo

    ls

    (Cla

    rket

    al.

    ,20

    13)

    http

    ://w

    ww

    .ale

    scor

    e.or

    g

    Ass

    embl

    yev

    alua

    tion

    CG

    AL

    -C

    alcu

    late

    sth

    eac

    cura

    cyof

    the

    asse

    mbl

    yba

    sed

    onth

    eco

    v-er

    age

    afte

    rre

    map

    ping

    the

    read

    sba

    ckto

    the

    scaf

    fold

    s.N

    one

    (Rah

    man

    and

    Pac

    hter

    ,201

    3)ht

    tp:/

    /bio

    .mat

    h.be

    rke-

    ley.

    edu/

    cgal

    /

    Ass

    embl

    yev

    alua

    tion

    GM

    valu

    e-

    Ali

    gns

    the

    asse

    mbl

    yto

    are

    fere

    nce

    geno

    me

    (or

    alte

    rnat

    ive

    asse

    mbl

    y)to

    iden

    tify

    mis

    asse

    mbl

    edre

    gion

    s.-

    Ane

    wse

    tof

    scaf

    fold

    sis

    gene

    rate

    dby

    spli

    ttin

    gth

    ere

    gion

    sid

    enti

    fied

    asm

    isas

    sem

    bled

    .

    MU

    Mm

    er(K

    urtz

    etal.

    ,20

    04):

    http

    ://m

    umm

    er.s

    ourc

    efor

    ge.n

    et/

    BL

    AS

    T+

    (Alt

    schu

    let

    al.

    ,19

    90;

    Cam

    acho

    et

    al.

    ,200

    9):

    ftp:

    //ft

    p.nc

    bi.n

    lm.n

    ih.g

    ov/b

    last

    /B

    owti

    e(L

    angm

    ead

    etal.

    ,200

    9):

    http

    ://b

    owti

    e-bi

    o.so

    urce

    forg

    e.ne

    t/Y

    AS

    S(N

    oéan

    dK

    uche

    rov,

    2005

    ):ht

    tp:/

    /bio

    info

    .lif

    l.fr

    /yas

    s

    (Kos

    ugi

    etal.

    ,20

    15)

    http

    s://

    sour

    cefo

    rge.

    net/

    pro-

    ject

    s/gm

    clos

    er/

    Ass

    embl

    yco

    r-re

    ctio

    niC

    OR

    N-

    Req

    uire

    spa

    ired

    -end

    read

    s.-

    Inte

    ract

    ivel

    yid

    enti

    fies

    and

    corr

    ects

    shor

    tm

    isas

    sem

    blie

    s,su

    chas

    base

    -sub

    stit

    utio

    nsan

    dsh

    ort

    IND

    EL

    s.

    SN

    P-o

    -mat

    ic(M

    ansk

    ean

    dK

    wia

    tkow

    ski,

    2009

    ):ht

    tps:

    //sn

    pom

    atic

    .svn

    .sou

    rcef

    orge

    .net

    /svn

    root

    /snp

    om

    atic

    SS

    AH

    AP

    ileu

    p(N

    ing

    etal.

    ,200

    1):

    ftp:

    //ft

    p.sa

    nger

    .ac.

    uk/p

    ub/z

    n1/s

    saha

    _pil

    eup/

    (Ott

    oet

    al.

    ,20

    10)

    http

    ://i

    corn

    .sou

    rcef

    orge

    .net

    /

    Ass

    embl

    yco

    r-re

    ctio

    nS

    EQ

    uel

    -R

    equi

    res

    pair

    ed-e

    ndre

    ads.

    -In

    tera

    ctiv

    ely

    iden

    tifi

    esan

    dco

    rrec

    tssh

    ort

    mis

    asse

    mbl

    ies,

    such

    asba

    se-s

    ubst

    itut

    ions

    and

    shor

    tIN

    DE

    Ls.

    -P

    erfo

    rms

    alo

    cal

    reas

    sem

    bly

    ofth

    em

    isas

    sem

    bled

    regi

    ons

    usin

    gin

    form

    atio

    nfr

    omk-

    mer

    san

    dpa

    ired

    -end

    read

    s.

    Java

    :ht

    tps:

    //w

    ww

    .jav

    a.co

    m/

    JGra

    phT

    (Jav

    ali

    brar

    y):

    http

    ://j

    grap

    ht.o

    rg/

    (Ron

    enet

    al.

    ,201

    2)ht

    tp:/

    /bix

    .ucs

    d.ed

    u/S

    EQ

    uel/

    Ass

    embl

    yco

    r-re

    ctio

    nG

    Fin

    ishe

    r-

    Doe

    sn’t

    requ

    ire

    pair

    ed-e

    ndre

    ads.

    -In

    tegr

    ates

    are

    fere

    nce-

    guid

    edsc

    affo

    ldin

    gst

    epan

    dga

    p-cl

    osin

    gpr

    oced

    ures

    ,alo

    ngw

    ith

    the

    asse

    mbl

    yco

    rrec

    tion

    proc

    ess.

    -Id

    enti

    fies

    mis

    asse

    mbl

    edre

    gion

    sba

    sed

    onth

    eG

    C-S

    kew

    dist

    ribu

    tion

    .

    Java

    :ht

    tps:

    //w

    ww

    .jav

    a.co

    m/

    (Gui

    zeli

    niet

    al.

    ,20

    16)

    http

    ://g

    fini

    sher

    .sou

    rcef

    orge

    .net

    /

    *=

    Con

    side

    ring

    aco

    mpu

    terr

    unni

    ngU

    NIX

    ,Lin

    uxor

    Mac

    OS

    oper

    atin

    gsy

    stem

    s(O

    Ss)

    .As

    Mak

    e,se

    d,aw

    k,G

    CC

    ,Per

    l,B

    ash,

    Pyt

    hon

    and

    the

    GN

    U/U

    nix

    stan

    dard

    util

    ity

    seta

    real

    read

    yin

    clud

    edin

    mos

    toft

    hedi

    stri

    -bu

    tion

    s/

    vers

    ions

    ofth

    ese

    OS

    s,th

    ese

    prog

    ram

    sw

    ere

    not

    list

    edas

    depe

    nden

    ces.

    Tab

    le1

    -co

    ntin

    ued

  • Scaffolding

    By definition, a contig consists of a contiguous se-

    quence has no unknown regions or assembly gaps (but may

    contain “N” that represent base-calling errors) (Staden,

    1979). On the other hand, a scaffold consists of two or morecontigs that have been joined according to some linkage in-formation (e.g., paired-end reads, genome maps) (Huson etal., 2002). Paired-end or mate-pair libraries can be veryuseful in de novo genome assembly, and several tools usethe relative position information to connect contigs intoscaffolds (Hunt et al., 2014). In a similar way, with the in-crease in the availability of genomic sequences from a widevariety of species, other scaffolding alternatives were de-veloped to use one or multiple genomes as reference to or-der the contigs.

    Paired-end scaffolding

    Most of the de novo genome assemblers usually inte-grate scaffolding steps after the contig constructions, al-though it is also possible to use third-party tools aiming amore reliable result. The A5 assembly pipeline (Tritt et al.,2012), for example, uses de novo assembler IDBA (Peng etal., 2010) to construct the contigs and SSPACE (Boetzer etal., 2011) to generate scaffolds. The scaffolding withpaired-end reads usually consist of the alignment of readsto the contigs, followed by the identification of connectionsbetween different contigs using the relative-orientation in-formation and the estimated insert-size. ABySS (Simpsonet al., 2009), SOPRA (Dayarian et al., 2010), SOAPdenovo(Li et al., 2010), Bambus 2 (Koren et al., 2011), MIP(Salmela et al., 2011), Opera (Gao et al., 2011), SSPACE(Boetzer et al., 2011), SLIQ (Roy et al., 2012), SGA (Sim-pson and Durbin 2012), SCARPA (Donmez and Brudno2013), WiseScaffolder (Farrant et al., 2015) andScaffoldScaffolder (Bodily et al., 2016) are examples ofscaffolding tools based on paired-end information. Morerecently, the use of long-reads was also incorporated intoscaffolding tools such as AHA (Bashir et al., 2012) andSSPACE-LongRead (Boetzer and Pirovano, 2014).

    ABySS (Simpson et al., 2009): The program abyss-scaffold, which comes with the ABySS assembly package(Simpson et al., 2009), uses the estimated mate-distancedistribution in paired-end reads to connect contigs and gen-erate scaffolds. The distance distribution can be calculatedby DistanceEst, that is also part of the package, and is alsoused by other assembly pipelines, such as SGA (Simpsonand Durbin, 2012). ABySS was developed to be used bothwith small and large genomes, and can be executed in acomputer clustering by using the Message Passing Inter-face (MPI), this being useful also in case of high-coveragedata and when dealing with multiple libraries. Like mostscaffolding programs, abyss-scaffold can be used also inthe scaffolding of contigs generated by third-party pro-grams. Finally, ABySS also supports scaffolding withlong-reads by using BWA-MEM for read alignment (Liand Durbin, 2009). The source code of the ABySS packageis available at the address http://www.bcgsc.ca/platform/bioinfo/software/abyss, and is developed to work on theLinux operating system.

    562 Kremer et al.

    Figure 1 - A flowchart demonstrating how and when the different genomefinishing approaches can be combined according to the data that is avail-able for the user. (a) Scaffolding using paired-end reads or long-reads,which is directly dependent on the way the genome was sequencing (plat-form, library), and sometimes performed as part of the de novo assemblyprocess. (b) Assembly integration, which consists in the combination ofdifferent de novo assemblies and generation of a consensus/extended as-sembly. Some programs use only the assemblies as input, while others usealso the sequencing reads. (c) The standard contig-ordering approachbased on a single reference genome, which consists in the identification ofsynteny blocks that guide the orientation of the contigs in the draft ge-nome, without taking into count the occurrence of genome inversionsother rearrangements. (d) The rearrangement-aware contig-ordering, thatidentifies potential sites of inversion and translocations based on signa-tures on the alignment against the reference genome. (e) The multi-ple-reference contig ordering, that may be more appropriate in those caseswhere there is no finished reference genome, but there is a relatively highnumber of close-related drafts, or when there are no apparent closest refer-ence to be used. (f) Assembly correction, which consists in the removing ofshort misassemblies, including base-substitutions and short insertions anddeletions. (g) Gap-closing, which consists in the joining of adjacentcontigs that used to be spaced by a gap. (h) Assembly evaluation, whichmay provide help to access the reliability of the assembly.

  • SOPRA (Dayarian et al., 2010): This scaffolding toolwas designed to improve assemblies generated by Velvet(Zerbino and Birney, 2008) and SSAKE (Warren et al.,2007), and targets the earlier sequencing platforms fromIllumina and ABI SOLiD. The program parses the read-placing file generated by these assemblers and extracts in-formation of paired-end/mate-pair reads, that is used to cal-culate the mean distance between pairs and the correctorientation. Based on this file, SOPRA also infers the con-nections between contigs by searches of those pairs of readswhere mates are in different contigs. The program is notfully automated, so each step of data processing must be ex-ecuted by a different script before the main scaffolding pro-cess. Another drawback is the limited support for differentde novo assemblers, as it requires read-placing files in AFGformat, and this is only produced by a few assemblers now-adays (e.g., Velvet, Ray and AMOS). SOPRA can be ob-tained from the website http://www.physics.rutgers.edu/~anirvans/SOPRA/.

    Bambus 2 (Koren et al., 2011): It is part of the AMOSpackage (Treangen et al., 2011) and is both a genome andmetagenome scaffolding tool and an updated version of theSanger-based program Bambus targeting NGS data (Pop etal., 2004). The program requires read-placing informationto construct a contig-graph, and explore the graph to findconsistent connections between the contigs. As Bambus2can also be used to scaffold metagenomic assemblies, dif-ferent from other programs, it considers the effect of DNAsamples containing mixes of closely related organisms inthe assembly processes and reduces the chance of fragmen-tation and miss-joining by analyzing the molecular vari-ants. However, the use of Bambus 2 is not as simple as forother scaffolding tools, as it requires some experience withthe AMOS tools to generate its input file and processes theoutputs (Treangen et al., 2011). The AMOS package can beobtained from the SourceForge repository:https://sourceforge.net/projects/amos/.

    MIP (Salmela et al., 2011): uses the concept of mixedinteger programming to generate a set of scaffolds from agenome assembly and a set paired-ends/mate-pair reads.First, readaligner (Mäkinen et al., 2010) is used to map theread-pairs back to the contigs. Then, the pairs are filtered toremove inconsistent connections, and the distances be-tween the contigs are estimated based on the mean distancebetween the mates, which is calculated for each library usedin the assembly. The connections in the generated scaffoldgraph have a minimum and maximum estimated length, de-rived from the library information. The MIP source code,along with usage instructions, is available athttps://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/.

    Opera (Gao et al., 2011): takes as inputs a collectionof contigs and mapped reads and generates a scaffold graphbased on the paired-end information. Frist, the program fil-ters the connections between contigs to remove possiblemiss-joining errors caused by chimeric pairs. The graph is

    contracted, and the optimum orientation of the contigsinside the scaffolds is inferred by a dynamic programmingalgorithm that explores the search space. The algorithm canalso infer the occurrence of repeated genomic regions, usu-ally assembled into a single contig in case of short-reads. Inthis case, repeated regions are identified by comparing thecoverage of the contigs to the mean coverage of the wholegenome and selection those with value greater than 1.5times the genomic mean. The identification of these regionsallows a contig to be used in more than one scaffolds, whichcan provide a better assembly of repeated regions, but canalso result in misassemblies. Opera can be obtained from itsSourceForge repository https://sourceforge.net/projects/operasf/.

    SGA (String Graph Assembler) (Simpson and Durbin2012): is a de novo genome assembler developed for thememory-efficient assembly of small and large genomes byapplying the method proposed by Myers (2005). As part ofits assembly pipeline, SGA also provide a scaffolding toolthat uses information from read alignment (in .BAM for-mat), that can be generated by a wide variety of mappingtools (Li et al., 2008; Langmead et al., 2009; Lunter andGoodson, 2011), and estimated distance between mates,generated by DistanceEst, from the ABySS package, toconnect contigs into scaffolds. SGA also supports scaffold-ing from multiple libraries, with different insert sizes, andwas optimized to work with Illumina data. SGA is availablefrom the GitHub repository https://github.com/jts/sga.

    SCARPA (Donmez and Brudno, 2013): usespaired-end information to generate scaffolds, but takes intoaccount that not only chimeric reads may be to responsiblefor inconsistent linkages between contigs but also mi-sassembled sequences. It estimates the mean and standarddeviation of the distance between the mates, but only usesinformation from those contigs with length greater than theassembly N50. The connections between the contigs are es-timated based on the mate information and the calculatedmetrics, and if more than one library is provided, SCARPAprocess the scaffolding iteratively starting from the librarywith smaller insertion size. The program can be obtainedfrom the URL http://compbio.cs.toronto.edu/hapsembler/scarpa.html.

    SSPACE (Boetzer et al., 2011) and SSPACE-LongRead (Boetzer and Pirovano, 2014): these scaffoldingprograms are currently distributed by BaseClear(http://www.baseclear.com), which also distributes thegap-closing program GapFiller (Boetzer and Pirovano,2012). SSPACE requires information about paired-end li-brary, including mean and standard deviation of distancebetween the mates and the expected orientation, whose val-ues can be predicted with the script “estimate_in-sert_size.pl”, distributed along with the program. The usermay choose between BWA (Li and Durbin, 2009) andBowtie (Langmead et al., 2009) for read mapping, the mini-mum number of connections to link two contigs, the num-

    Finishing microbial genomes 563

  • ber of bases that will be removed from the border of thecontigs (as they usually contain errors), and the number ofiterations. For SSPACE-LongRead, the target assembly isaligned to a collection of long-reads using BLASR (Chai-sson and Tesler, 2012) and the alignments are filtered andrefined to find the best orientation. Both SSPACE andSSPACE-LongRead can be requested from the BaseClearwebsite http://www.baseclear.com/genomics/bioinformatics/basetools/.

    Hunt el al. (2014) have performed an extensive com-parison of the scaffolding tools and demonstrated that thequality of the resulting scaffolds is directly affected by theread-mapping program and the complexity of the genome.For the tested datasets, the best results were obtained bySGA, SOPRA and SSPACE, although all tested tools pre-sented a certain percentage of miss-joined scaffolds in theiroutputs. Some scaffolding tools (e.g., SGA) use pre-aligned reads as input, so the user is able to test and choosethe read mapper. In this case, it is important to try differentread mappers, taking into account that platform-specificbias and read quality may have a drastic effect on the qual-ity of the alignment (Hatem et al., 2013; Caboche et al.,2014). When using mate-pair libraries (long-insert paired-

    end reads) it is also important to check if the scaffolder wasdesigned to support it, or if it was just designed for stan-dard, short-insert, paired-end reads. As mate-pairs maypresent a relatively high rate of “false mates”, special caremay be taken when working with this type of data.

    Single reference-based scaffolding

    In many cases the pairing information is not enoughto generate a reliable reconstruction of the genome’s struc-ture, or simply, the genome was not sequenced usingpaired-reads, but with single-end sequencing. In order toovercome this, some tools were developed to use a refer-ence genome as a template to perform the contig orderingand relative positioning (Figure 2). Software like MUMmer(Kurtz et al., 2004), ABACAS (Assefa et al., 2009),CONTIGuator (Galardini et al., 2011) and Mauve (Darlinget al., 2004; Rissman et al., 2009) are able to identify themost probable orientation of the contigs, but may generateincorrect results in the case of genome inversions of trans-locations. On the other hand, SIS (Dias et al., 2012), CAR(Lu et al., 2014), and FillScaffolds (Muñoz et al., 2010)consider the occurrence of changes in the genomic struc-ture and take these phenomena into account during the

    564 Kremer et al.

    Figure 2 - Reference-based contig ordering. (a) The program takes a set of contigs (or scaffolds) and (b) aligns these to a reference genome to identify themost probable relative orientation of the sequences in the draft genome. (c) Regions not covered by the contigs represent gaps and may be sequencing/as-sembling artifacts or natural deletions. Based on the relative position of each contig, a scaffold is created.

  • analysis, generating a more accurate reconstruction. Allthese tools use information from a single genome as refer-ence, however more recently, some tools, such as Ragout(Kolmogorov et al., 2014) and MeDuSa (Bosi et al., 2015),were developed to use information from multiple referencegenomes, allowing an evolutionary-based inference ofstructural re-arrangements. These multiple reference-basedcontig ordering tools will be discussed in the next section.

    MUMmer (Kurtz et al., 2004): is a genome-scale se-quence alignment tool which can be applied to perform thealignment of a set of contigs/scaffolds to a reference ge-nome, allowing a wide variety of applications in genomicanalysis and NGS data processing, including reference-guided scaffolding. The two main algorithms of theMUMmer package are NUCmer, which performs a stan-dard DNA-DNA alignment, and PROmer, which performsan alignment of the six reading frames of both sequences(leading to a more sensitive result, especially in the case ofhighly divergent organisms). The package also includesother tools, such as delta-filter, that can be used to removethe ambiguities in the alignments and select those that aremore relevant for the analysis. Many scaffolding tools, likeABACAS (Assefa et al., 2009), CONTIGuator (Galardiniet al., 2011) and MeDuSa (Bosi et al., 2015), are built ontop of MUMmer and take advatange of its performance, butalso add new features to improve the output. MUMmer it-self does not provide the sequence of the scaffold, just thepositions of the alignments. Therefore, it is necessary toperform a post-processing of the results to obtain the se-quence of the scaffolds. MUMmer can be obtained from itsSourceForge repository http://mummer.sourceforge.net/.

    ABACAS (Algorithm-based Automatic Contigua-tion of Assembled Sequences) (Assefa et al., 2009): can useNUCmer or PROmer from the MUMmer (Kurtz et al.,2004) package to align the contigs against a reference ge-nome. The regions that do not have an equivalent sequencein the contig set are filled with Ns, indicating gaps.ABACAS can also be used to design PCR primers to am-plify the unknown regions by integrating Primer3 (Ko-ressaar and Remm, 2007; Untergasser et al., 2012).ABACAS can be obtained from its SourceForge repositoryhttp://abacas.sourceforge.net/, and as part of the PAGITpackage (Swain et al., 2012), available at http://www.sang-er.ac.uk/science/tools/pagit.

    CONTIGuator (Galardini et al., 2011): usesABACAS (Assefa et al., 2009) to perform the contig order-ing, but adds support to multiple references, which may beuseful in the case of organisms that have more than onechromosome. BLAST (Altschul et al., 1990; Camacho etal., 2009) is used to align the contigs used as input with thereference sequences to identify the correct reference foreach sequence. Then, ABACAS (Assefa et al., 2009) isused, and its results are integrated with the BLAST align-ment to generate a final assembly. CONTIGuator can beobtained from its SourceForge repository

    http://contiguator.sourceforge.net/, and is also available asa webserver http://combo.dbe.unifi.it/contiguator.

    Mauve (Darling et al., 2004; Rissman et al., 2009): isan alignment tool that can handle and align multiple ge-nomes and identify regions of high similarity called Lo-cally Collinear Blocks (LCBs). One of the program’s fea-tures, Mauve Contig Mover, performs contig orderingusing the same algorithm (Rissman et al., 2009). The pro-gram runs in an iterative mode, generating and optimizingthe contig orientations based on the reference until nochange is possible that can improve the model. A directoryis generated for each iteration that contains inputs to visual-ize the genome in Mauve and a FASTA file with the sortedcontigs. The Mauve aligner can be obtained from the URLhttp://darlinglab.org/mauve/mauve.html.

    FillScaffolds (Muñoz et al., 2010): analyzes the ge-nomic distance between the contig set and a reference ge-nome and generates an ordered sequence throughidentifying orthologous genes. It considers the effects ofthe evolutionary distance in the case of missing genes, andthen uses the position of the orthologos present in the refer-ence to order the contigs. The source code of FillScaffoldsis available as a supplementary data of the Muñoz et al.(2010) paper at:http://bmcbioinformatics.biomedcentral.com/arti-cles/10.1186/1471-2105-11-304.

    SIS (Scaffolds from Inversion Signatures) (Dias etal., 2012): takes as input a set of contigs in FASTA formatand a coordinate file generated by NUCmer or PROmer(Kurtz et al., 2004) after these contigs have been alignedwith the reference sequence. Using the coordinates, theprogram searches for inversion signatures and generates acollection of orientations of the sequences that can be usedto construct the scaffolds. The source code of SIS can beobtained from the URL http://marte.ic.unicamp.br:8747.

    CAR (Contig Assembly using Rearrangements) (Luet al., 2014): uses NUCmer and PROmer in combination,unlike ABACAS (Assefa et al., 2009) and SIS (Dias et al.,2012), that use the result of only one. Based on the coordi-nates, CAR uses a block permutation model to generate thecontig order by considering not only the effect of the ge-nomic inversions, but also the occurrence of transpositions(Li et al., 2013). CAR can be used from the webserverhttp://genome.cs.nthu.edu.tw/CAR/, where the source codeis also available for download.

    Considering the main algorithm of each program, it isimportant to keep in mind that the most appropriate tool fora given task will depend on the organism and the availabil-ity of the reference genomes. ABACAS (Assefa et al.,2009) is very useful if the reference genome is larger thanthe target genome (considering the sum of the length of allcontigs, and that all contigs have a homologous region inthe reference), and the primer designing tools might behelpful in some cases; however, its sensibility decreases incases of structural divergence. In such cases, other tools,

    Finishing microbial genomes 565

  • like CONTIGuator (Galardini et al., 2011) and Mauve(Darling et al., 2004; Rissman et al., 2009), may be moreeffective. Finally, SIS (Dias et al., 2012) and CAR (Lu etal., 2014) are indicated if the draft genome may presentgenomic inversions or transpositions. For most of the appli-cations, these tools usually provide reliable results, espe-cially for organisms that do not show a very variable ge-nomic organization, and/or when there are enough finishedgenomes to properly choose the best reference. However,in some situations it may be necessary to evaluate differenttools and references to check which one provides the bestresults. Finally, a single-reference may also lead to an“overfitted” ordering, especially when the reference issmaller, or in case of genomic inversions and translo-cations.

    Multiple reference scaffolding

    Sometimes it is very difficult to identify the most ap-propriate reference genome to use for contig ordering, es-pecially when structural rearrangements are commonevents in the genus/species of interests. Additionally, whenusing BLAST to identify the most “close-related” strainfrom a database of already finished genomes, it is not usualto find different strains as best hit for each contig. Finally,there are also those cases where no finished genome isavailable, but there are draft genomes of related strains. Inthese cases it would not be appropriate to use programs thattake into account alignments against only one reference, sodata from multiple organisms should be considered. Theuse of multi-references is relatively recent and another con-sequence of the advent of NGS, as there is more draftgenomes than finished ones available in public databases.Examples of algorithms and programs that use this ap-proach are RACA (Kim et al., 2013), Ragout (Kolmogorovet al., 2014) and MeDuSa (Bosi et al., 2015).

    RACA (Reference-Assisted Chromosome Assem-bly) (Kim et al., 2013): uses local sequence alignment toidentify co-linear synteny blocks. The synteny blocks arefiltered using a length threshold, and based on the referencegenomes, the probability of each synteny block adjacent tothe others is calculated. This probability can also be com-bined with paired-end information to identify the mostprobable set of scaffold. The source code of RACA can beobtained from the URL http://bioen-compbio.bioen.illi-nois.edu/RACA/.

    Ragout (Kolmogorov et al., 2014): uses phylogeneticinformation and synteny blocks to order a set of contigsfrom a target genome using multiple genome references.First, Sibelia (Minkin et al., 2013) is used to identifysynteny blocks shared by the target and the reference se-quences. Based on the synteny, the nucleotides of the ge-nomes are represented as sequences of blocks, and the best“block orientation” is identified by a maximum parsimony,taking into account the block order in the reference ge-nomes. The source code of Ragout can be obtained from its

    repository at GitHub https://github.com/fenderglass/Ra-gout.

    MeDuSa (Multi-Draft based Scaffolder) (Bosi et al.,2015): is a graph-based scaffolder that uses informationfrom multiples references, which can be finished or draftgenomes. The program uses NUCmer to alignment the tar-get genomes to the references and construct a weightedgraph based on the alignments were the nodes of the graphare connected by identifying those contigs that aligned tothe same sequence in the references. In the next step, theorientation of each contig is assigned based on the align-ment information and the most-probable ordered identifiedin the graph. The source code of MeDuSa can be obtainedfrom the repository at GitHub https://github.com/combogenomics/medusa.

    Assembly integration

    Different assemblers, or even the same assembler ex-ecuted with different configurations, may produce differentresults. Minimum coverage, coverage cut-offs, minimumcontig length, and k-mer size are examples of just some pa-rameters that can affect the decisions of the assembler dur-ing the construction of the contigs (Baker, 2012). The waylow-quality reads are treated, or how the correct paths in theassembly graph are constructed, is also different for eachprogram. As different assemblies may present differentrepresentations of a given region in the genome, the con-struction of a “consensus assembly” can be an effectivemethod of reducing assembly errors and generating an opti-mized set of contigs. This process, which is sometimescalled “assembly reconciliation,” “assembly merging,” or“assembly integration,” can receive as input only a set ofassemblies, as implemented in Minimus (Sommer et al.,2007), Reconciliator (Zimin et al., 2008), MAIA (Nijkampet al., 2010), CISA (Lin and Liao 2013), GAA (Yao et al.,2012) and Mix (Soueidan et al., 2013), or both a set of as-semblies and the reads used for the assembly, as is the casewith GAM-NGS (Vicedomini et al., 2013) and Zorro(Argueso et al., 2009).

    Minimus (Sommer et al., 2007): is an assembly toolfrom the AMOS package (Treangen et al., 2011). Initiallyconceived to perform assembly of small genomes, it wasposteriorly adapted for assembly integration. The main al-gorithm is based on the overlap-layout-consensus paradigm(Peltola et al., 1984), which involves taking a set of se-quences and performing several alignments to identifyoverlaps. The information provided by the alignments isused to construct a graph that is minimized by a combina-tion of algorithms (Myers, 1995, 2005) to generate a finalassembly. Minimus is available as part of the AMOS pack-age, which can be obtained from the SourceForge reposi-tory https://sourceforge.net/projects/amos/.

    Reconciliator (Zimin et al., 2008): uses NUCmer,from the MUMmer package (Kurtz et al., 2004), to identifyassembly errors by comparing a template with a secondary

    566 Kremer et al.

  • assembly. With the alignments, the tool is able to identifythe regions that have possibly suffered compression or ex-pansion due to assembly errors in repetitive DNA se-quences. The source code of Reconciliator is available fromthe URL http://www.genome.umd.edu/.

    MAIA (Multiple Assembly Integrator) (Nijkamp etal., 2010): uses the overlap-layout-consensus paradigm in asimilar way to Minimus to construct a graph based on theoverlaps identified by MUMmer (Kurtz et al., 2004). Theconnections in the graph are used to construct a new assem-bly, and contigs that have no connection can be integratedwith the assembly using a reference genome as a template.MAIA was implemented on top of the Matlab program-ming language and is available as a package for it that canbe obtained from the URL http://bioinformatics.tudelft.nl.

    GAA (Graph Accordance Assembly) (Yao et al.,2012): is an assembly integration software that is based on ahomonymous data structure. Taking a set of contigs as in-put, the tool uses BLAT (Kent, 2002) to generate align-ments, identify overlaps and then generate a graph thatrepresents the connections between the contigs. GAA isavailable from the SourceForge repositoryhttp://sourceforge.net/projects/gaa-wugi/.

    CISA (Contig Integrator for Sequence Assembly)(Lin and Liao, 2013): uses a four-step algorithm to generatethe merged assembly. First, a set of representative contigsis chosen from the individual assemblies. Assembly errorsare identified by aligning all sets to one another, and any re-gions that are present in only one sequence are consideredto be erroneous. In the event of errors, the contigs are bro-ken in the incorrect portion into smaller sequences. Thethird step consists of generating several alignments usingBLAST (Altschul et al., 1990; Camacho et al., 2009), andNUCmer (Kurtz et al., 2004) to identify the optimal lengthof repetitive sequences. The information generated in thethird step is used to construct the merged assembly in the fi-nal stage of the program. CISA can be obtained from theURL http://sb.nhri.org.tw/CISA/.

    Mix (Soueidan et al., 2013): uses alignments gener-ated by NUCmer (Kurtz et al., 2004) to generate an exten-sion graph where the contigs are connected by theirborders. The alignments are filtered to remove repetitivesequences, and this information is used to generate a graph.Finally, the algorithm parses the graph to identify the Maxi-mal Independent Longest Path Set (MILPS) that representsthe final assembly. The Mix source code is available at theGitHub repository https://github.com/cbib/MIX.

    GAM (Genomic Assemblies Merger) (Casagrande etal., 2009) and GAM-NGS (Genomic Assemblies Mergerfor Next Generation Sequencing) (Vicedomini et al.,2013): GAM takes an assembly as a template, which is re-ferred to as the “master”, and extends it using one or moresets of auxiliary assemblies (called “slaves”)