44
Table of Contents Table of Contents ................................................................................................... 1 S1: Genomic DNA Isolation ..................................................................................... 3 S1.1 Genomic DNA preparation .................................................................................................................... 3 S1.2 Mucus removal from planarian surface .......................................................................................... 4 S1.3 HMW DNA isolation for PacBio sequencing and Chicago libraries ..................................... 4 S1.4 Post purification of DNA using CTAB ............................................................................................... 5 S2: PacBio Long Read Sequencing ........................................................................... 5 S2.1 Large insert libraries ............................................................................................................................... 6 S3: Assembly Pipeline ............................................................................................. 7 3.1 MARVEL assembler .................................................................................................................................... 7 3.2 MARVEL - Determining read quality .................................................................................................. 8 3.3 MARVEL - Annotation of repeat regions ........................................................................................... 8 3.4 MARVEL - Read correction...................................................................................................................... 9 3.5 MARVEL - Computational requirements........................................................................................... 9 S4: Chicago/HiRise Scaffolding ............................................................................. 10 S4.1 Chicago library preparation and sequencing ............................................................................ 10 S4.2 HiRise scaffolding of PacBio-MARVEL assembly ..................................................................... 10 S4.3 Analysis of HiRise changes to PacBio-MARVEL assembly ................................................... 11 S4.4 Scaffold gap closure using PBJelly .................................................................................................. 11 S5: Assembly Polishing ......................................................................................... 11 S5.1 Final Error Correction with Quiver ................................................................................................ 11 S5.2 Redundancy filtering ............................................................................................................................ 12 S5.3 Consensus polishing using Pilon ..................................................................................................... 12 S6: RNAseq and Transcriptomes ........................................................................... 13 S6.1 RNA extraction and sequencing ...................................................................................................... 13 S6.2 Smed_v6 transcriptome assembly: Raw data, assembly parameters, filtering, annotations ......................................................................................................................................................... 13 WWW.NATURE.COM/NATURE | 1 SUPPLEMENTARY INFORMATION doi:10.1038/nature25473

SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

  • Upload
    ngokiet

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

TableofContents

TableofContents...................................................................................................1

S1:GenomicDNAIsolation.....................................................................................3S1.1GenomicDNApreparation....................................................................................................................3S1.2Mucusremovalfromplanariansurface..........................................................................................4S1.3HMWDNAisolationforPacBiosequencingandChicagolibraries.....................................4S1.4PostpurificationofDNAusingCTAB...............................................................................................5

S2:PacBioLongReadSequencing...........................................................................5S2.1Largeinsertlibraries...............................................................................................................................6

S3:AssemblyPipeline.............................................................................................73.1MARVELassembler....................................................................................................................................73.2MARVEL-Determiningreadquality..................................................................................................83.3MARVEL-Annotationofrepeatregions...........................................................................................83.4MARVEL-Readcorrection......................................................................................................................93.5MARVEL-Computationalrequirements...........................................................................................9

S4:Chicago/HiRiseScaffolding.............................................................................10S4.1Chicagolibrarypreparationandsequencing............................................................................10S4.2HiRisescaffoldingofPacBio-MARVELassembly.....................................................................10S4.3AnalysisofHiRisechangestoPacBio-MARVELassembly...................................................11S4.4ScaffoldgapclosureusingPBJelly..................................................................................................11

S5:AssemblyPolishing.........................................................................................11S5.1FinalErrorCorrectionwithQuiver................................................................................................11S5.2Redundancyfiltering............................................................................................................................12S5.3ConsensuspolishingusingPilon.....................................................................................................12

S6:RNAseqandTranscriptomes...........................................................................13S6.1RNAextractionandsequencing......................................................................................................13S6.2Smed_v6transcriptomeassembly:Rawdata,assemblyparameters,filtering,

annotations.........................................................................................................................................................13

WWW.NATURE.COM/NATURE | 1

SUPPLEMENTARY INFORMATIONdoi:10.1038/nature25473

Page 2: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S6.3Smes_v1transcriptomeassembly:Rawdata,assemblyparameters,filtering,

annotations.........................................................................................................................................................13

S7:AssemblyQualityControl...............................................................................14S7.1Transcriptomeback-mapping..........................................................................................................14S7.2Genomecompletenessanalysis.......................................................................................................14S7.3Transcript-basedgenomeassemblyqualitycontrol..............................................................15S7.4Contaminationcheck............................................................................................................................17

S8:UCSCGenomeBrowser...................................................................................18S8.1GEM-Mappabilitytrack.....................................................................................................................18S8.2RepeatMasker–Repbasetrack........................................................................................................18S8.3RepeatMasker–RepeatModelertrack.........................................................................................19S8.4AlternativeScaffoldstrack.................................................................................................................19S8.5Quivertrack..............................................................................................................................................19S8.6PacBiocoveragetrack..........................................................................................................................19S8.7Transcriptstrack....................................................................................................................................19S8.8LinkstoPlanMine..................................................................................................................................20

S9:RepetitiveElementPrediction........................................................................20S9.1Denovopredictionofrepetitiveelements..................................................................................20

S10:LTRAnalysis..................................................................................................21S10.1LTRconsensusconstruction..........................................................................................................21S10.2GypsyLTRelementannotation.....................................................................................................22S10.3TheBurrofamily.................................................................................................................................22S10.4LTRelementexpressionanalysis.................................................................................................23S10.5LTRelementdivergenceanalysis.................................................................................................23S10.6Solo/pairedLTRelementabundanceanalysis.......................................................................23S10.7Scaffoldendanalysis..........................................................................................................................24

S11:AAT-RepeatAnalysis.....................................................................................24S11.1AAT-microsatellitepredictionandreadalignmentbreakanalysis..............................24S11.2MicrosatelliterepeatlengthestimationbyCircularConsensusSequencing(CCS)24S11.3OverlaplengthvariationinCCSandP6/C4reads................................................................25S11.4GenomiccontextofAT-richregions............................................................................................26

S12:AssemblyHeterozygosity..............................................................................26S12.1Drosophilaassembly.........................................................................................................................26S12.2Dotplotanalysis..................................................................................................................................27S12.3Coverageanalysisaltvs.maincontigs.......................................................................................27

WWW.NATURE.COM/NATURE | 2

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 3: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S13:GeneAnnotation..........................................................................................27S13.1Geneannotationandgenenumberestimate.............................................................................27S13.2Genestructure......................................................................................................................................28S13.3Orthologyanalysis..............................................................................................................................28

S14:DivergenceAnalysis......................................................................................28S14.1Divergenceanalysis............................................................................................................................28

S15:SyntenyAnalysis...........................................................................................29S15.1Syntenycomparisons........................................................................................................................29

S16:PlanarianSpecificGeneSequences...............................................................29S16.1Identificationofplanarian-specificgenes................................................................................29S16.2Domainpredictions............................................................................................................................31S16.3Expressionanalysis............................................................................................................................31

S17:GeneLossinPlanarians.................................................................................31S17.1Lostgenesidentificationandverification................................................................................31S17.2Essentialityassessmentoflostgenes.........................................................................................33S17.3Mad1/Mad2multiplesequencealignments............................................................................33

S18:PlanarianSACFunction.................................................................................33S18.1Quantificationofmitoticindexindissociatedcellpreparations....................................33

S19:MiscellaneousExperimentalMethods..........................................................36S19.1Animalhusbandry...............................................................................................................................36S19.2Transcriptcloning...............................................................................................................................36S19.3Wholemountinsituhybridization.............................................................................................37S19.4Karyotyping...........................................................................................................................................37

SupplementaryTables..........................................................................................37

References...........................................................................................................38

S1:GenomicDNAIsolation

S1.1GenomicDNApreparation

OurgenomicDNAisolationprotocolimprovesuponapreviouslydevelopedmethod

for DNA isolation procedure for planarians1, making it compatible with single

molecule real-time (SMRTsequencing)2onaPacBioRSIIand theChicagomethod3.

WWW.NATURE.COM/NATURE | 3

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 4: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

Importantmodificationsincludeplanarianexternalmucusremoval,thechoiceofsalt

andalcoholforDNAprecipitationenablingpigmentremovalandapost-purification

step that removes likely remainingmucopolysaccharides, which tend to co-isolate

withgDNA.

S1.2Mucusremovalfromplanariansurface

This step makes use the mucolytic agent N-acetyl-L-cysteine (NAC) to strip the

animals of surface mucus, which can be excreted in copious amounts by large

animals.SinceNACisacidicinaqueoussolutionandthehighestmucolyticactivityis

observed between pH 7 to 9 the solution was neutralized, while the pH was

monitoredusingthecommonpHindicatordyephenolred4.

Theanimalswerewashedin10mlfreshlypreparedNACstrippingsolution(0.5w/v

N-acetyl-L-cysteine, 20mMHEPES-NaOH,pH7.25, 5µl phenol-red solution (0.5%

w/v), pH to ~7 using 1MNaOH) for 10-15min at room temperaturewith strong

agitation (e.g. rotator). The worms were rinsed briefly in distilled water prior to

furtherprocessing.

S1.3HMWDNAisolationforPacBiosequencingandChicagolibraries

All procedures were carried out at room temperature (RT) unless otherwise

specified. For handlingHMWDNA, onlywide-bore tipswereused and the sample

wasmixedonlybycarefulinversion.ForDNAisolation,20mucusstrippedanimals(~

1cm;starvedfor1-3weeks)weretransferredtoa50mltubeandresidualliquidwas

removed. The animals were lysed in 15 ml of cold GTC buffer (4 M guanidinium

thiocyanate, 25 mM sodium citrate, 0.5 % (w/v) N-Lauroylsarcosine, 7 % v/v β-

mercaptoethanol)for30minonice.Thetubewascarefullyinvertedevery10minto

help dissociating the tissue. Then, 1volume of phenol/chloroform

(Phenol/chloroform/isoamylalcohol(25:24:1),10mMTris(pH8.0),1mMEDTA)was

addedtothelysateandmixedbycarefullyinvertingthetube10to15times.Phases

wereseparatedbycentrifugationfor20minat4,000xgat4°Candtheupperphase

carefullytransferredtoafreshtube.Thephenol/chloroformextractionwasrepeated

1-2 times, until the interphase disappeared. Residual phenol was removed by

extracting the aqueous phase once with 1 volume of chloroform and mixing by

inversion 10 to 15 times and another roundof centrifugation. The upper aqueous

WWW.NATURE.COM/NATURE | 4

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 5: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

phasewastransferredtoafreshtubeand1volumeofice-cold5MNaClwasadded

andmixedwellby invertingthetubeseveraltimes.Thetubewasplacedonicefor

15mintosaltoutadditionalcontaminants,whichwerepelletedbycentrifugationfor

10minatmaximumspeedat4°C.Thesupernatantcontainingthenucleicacidswas

carefullydecantedorpipettedofftoafreshtubewithoutdisturbingthepellet.The

nucleic acids were precipitated using 0.7-1 volumes of isopropanol and

centrifugationat2,000xgat30-45minatRT.Thepelletwascarefullywashedwith

70 % ethanol and spun at 2,000 x g at 5 min at RT and briefly air-dried and re-

suspendedin50µlTEbufferovernightat4°C.

S1.4PostpurificationofDNAusingCTAB

Allprocedureswerecarriedoutat roomtemperature,ascetyltrimethylammonium

bromide(CTAB)precipitatesattemperatureslowerthan15°C.Thisstepallowedthe

removalofcontaminants (e.g.polysaccharides) fromDNApreparationsbycomplex

formationthroughtuningtheNaClconcentration5.

Theisolatedandre-suspendednucleicacidsweretreatedwith1µlof4mg/mlRNase

Aandincubatedfor1h/37°CtoremoveRNA.TheNaClconcentrationwasadjusted

byadding0.3volumesof5MNaClandthevolumewasadjustedto500µlwith2%

CTAB(w/v)/1.4MNaClsolution.Aftercarefulandthoroughmixingbyinversion,1

vol of chloroform was added and carefully mixed by inversion until the sample

turnedmilky. Thephaseswere separatedby centrifugation for 15min at 12,000–

16,000xgatRT.Theupperclearphasewasrecoveredwithoutdisturbingthewhite

interphaseusingawideborepipettetipandtransferredtoanewtube.TheDNAwas

precipitatedusing0.7-1volumesofisopropanolandcentrifugationat2,000xgat30-

45minatRT.Thepelletwascarefullywashedwith70%ethanolandspunat2,000x

gat5minatRTandbrieflyair-driedandre-suspendedin50µlTEbufferovernight

at 4 °C. Quality of the DNA was checked by agarose gel electrophoresis and

quantifiedusingtheQubitTMfluorometer(dsDNABRassay).

S2:PacBioLongReadSequencingQuality control of the high molecular weight genomic DNA was performed by

spectrophotometricdeterminationof260/280and260/230ratiosmakinguseofthe

WWW.NATURE.COM/NATURE | 5

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 6: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

Nanodropdevice(ThermoScientific).The260/280ratioswereintherangeof1.66to

2.08and260/230ratiosbetween1.84and2.01.Theconcentrationwasdetermined

intriplicatesviaaQubitTMfluorometerwiththehighsensitivitystandard.Integrityof

the HMW gDNA was determined by pulse field gel electrophoresis run using the

PippinPulseTMdevice(SAGEScience).BeforestartinganyPacBiolibrarypreparation,

planarian gDNA samples were purified with 1X AMPure XP beads. Note that the

AMPuremethodcannotrecoverplanarianDNAoutofcrudelysates,possiblydueto

resorptivecompetitionoftheabundantcontaminants.

S2.1Largeinsertlibraries

Toprepare large insert libraries,AMPurepurifiedgenomicDNAwasfragmentedto

sizesof12-30kbpwiththeMegaruptordevice(Diagenode).Forshearing,either20

kbp shearing settings for medium insert SMRT bell libraries or 35 kbp shearing

settings for large insert SMRT bell libraries were used. PacBio SMRT bell libraries

werepreparedasdescribedinthecompany-supplied“ProcedureandChecklist–20

kbp Template Preparation Using BluePippinTM Size-Selection System”. SMRT bell

libraries were size selected with the BluePippinTM system (Sage Science) with a

minimumfragmentlengthcutoffof7.5and10kbpformediuminsertlibrariesandat

15and17.5kbpforlargeinsertlibraries.

PacBioRSIISMRTcellswere loadedwithSMRTbell librariesafterprimerannealing

and polymerase binding usingMagBead loading. A total of 197 SMRT cells loaded

with the medium insert libraries were sequenced on the PacBio RSII instrument

making use of the P4 polymerase and the C2 sequencing chemistry (P4/C2). In

addition,atotalof38SMRTcells loadedwithlargeinsert librariesweresequenced

ontheRSIIinstrumentwiththeP6polymeraseandC4sequencingchemistry(P6/C4).

MovietimesforallP4/C2SMRTcellswere240minutes.MovietimesforP6/C4SMRT

cellswere240min(20SMRTcells)and360minutes(23SMRTcells).

WWW.NATURE.COM/NATURE | 6

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 7: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S3:AssemblyPipeline

3.1MARVELassembler

AtthetimeoftheinitialsequencingofSmedwiththeP4/C2chemistry,thestandard

workflowforassemblingnoisylong-readsfromPacBiowastocleanthereads,either

withIlluminasequencingdataorbyreducingthePacBionativeerrorrateof12-15%

tounder1%byhighcoveragesequencing(PacBioToCA).Afterthiscorrectionstep,

theoriginalCeleraassembler6wasusedtoassemblethelong-reads.However,initial

cleaning of the reads can have one of two disadvantages, especially in a genome

withahighrepeatcontentsuchasSmed.Eithertherepeatswithinthereadsarenot

cleanedandtheassemblyofsuchregionsbytheCeleraassemblerishindereddueto

aanerrorratehigherthanexpected(<1%)ortherepeatsare incorrectlycleaned

duetotheambiguouscharacterofoverlapsbetweenrepeats.Bothapproacheslead

to incorrectandfragmentedassemblieswithpossiblemisjoins.Therefore,MARVEL

was developed to deal with notoriously complex genomes that contain a high

fractionof repetitivegenomic informationandaresequencedwithnoisy long-read

technologies (Novojilov et al., in preparation). For comparisons with the Canu

assembler7 (see Table 1,main text),weusedCanu v1.3with standardparameters

andthesameuncorrectedPacBioreadsasfortheMARVELassembly.

TheMARVELassembler currently consistsof threemajorphasesnamely the setup

phase,patchphaseandtheassemblyphase. Inthesetupphase,dependingonthe

sequencing technology, thereadsareextracted fromtherawsequencingdataand

saved inan internaldatabase.Thepatchphasedetectsandcorrects readartefacts

includinguntrimmedadapters,polymerasestrand jumps,and ligationchimers that

aretheprimaryimpedimentstolongcontiguousassemblies,whicharethenusedfor

the final assembly phase. The assembly phase stitches short alignment artefacts

resultingfrombadsequencingsegmentswithinoverlappingreadpairs.Thedefault

parameterisgenerally50bpinlength,butforSmed300bpwereused.Thisstepis

followed by repeat annotation and the generation of the overlap graph, which is

subsequentlytouredinordertogeneratethefinalcontigs.

WWW.NATURE.COM/NATURE | 7

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 8: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

AgenomewithaskewedGC-content,suchastheSmedgenome,hasadifferentk-

merdistributionthanthosewithauniformdistributionoftheGCcontent.Thiscan

beusedtoweighcertaink-mersandtheirexpecteddistributionwithinthegenome.

Therefore,theATbiasoption(-b)wasusedinDALIGNER8(thenoisyreadalignerused

inMARVELtofindtheoverlapsbetweenreads)aswellasinDBdust,toavoidmasking

AT-rich regions as low complexity and thereby excluding them from the k-mer

seeding.

In addition, alignmentswere forced through variableATmicrosatellites (< 300bp)

despiteoftensignificant lengthdiscrepanciesandnon-uniformityof their instances

in the reads. This allowed for amore contiguous assembly. The criteria for repeat

module identification in overlaps was relaxed, allowing for better recognition of

repeat modules and therefore resulting in improved resolution of repetitive

elements.

Since the Smed dataset combined sequencing data generated by two different

chemistries(P4/C2andP6/C4),eachchemistrybatchwaspatchedindividuallyprior

tocombiningthedatasetsasinputfortheassemblyphase.Intheassemblyphase,

theoverlapof thecombineddatasetwascalculatedandassembled.Afterall tours

were calculated, the contigs were corrected and a consensus sequence was

generatedbyremappingthepatchedreads.

3.2MARVEL-Determiningreadquality

Inordertodeterminethequalityofaread,theconsensusofthealignedsupporting

readswereused.MARVELdoes not determine thequality of individual bases, but

rather thequalityof the subreadwithin the tracepoints (usually consistingof100

bp). The end result of the process is the determination of a consensus base

supportedbythealignedreadoverlapsbetweenthetracepoints.

3.3MARVEL-Annotationofrepeatregions

Therepeat regionswere initiallyannotatedby themaskingserverand laterduring

thescrubbingphasebyDBrepeat.Whilebothusethealignmentstodeterminerepeat

regions, themasking serverannotates the repeat regiondynamicallyandonce the

WWW.NATURE.COM/NATURE | 8

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 9: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

repeat coverage threshold is exceeded (in most cases twice the sequencing

coverage)theregionismaskedandnotusedforsubsequentk-merseeding.

DBrepeat runs sequentially over the reads to annotate the repeat regions. Once a

basewithacoveragethatexceedstherepeatcoverageisidentified,arepeatregion

isopened.Arepeatregionisclosedonceabaseatorbelowthenon-repeatcoverage

is found. Theminimumoverlap length of 1 kbp results in complete lack of repeat

annotationforthe1kbpregionsatthestartandendofthereads.DBhomogenizecan

beusedtotransitivelytransferanexistingrepeatannotationbasedonthecomputed

alignments,therebyproperlyannotatingthetipsofthereads.

3.4MARVEL-Readcorrection

Despite thenoisynatureof PacBio reads,MARVELdoesnot clean the reads to an

error-rateoflessthan1%asiscommonpracticeofotherassemblerswhendealing

with noisy long-reads. The reads are first patched after the initial overlap phase,

removing themajor sequencing related insertions/deletions (indels), chimeras and

additionalartefacts,whichisnecessaryforqualityoverlapsthatwillbeusedforthe

assembly.Afterasecondroundofoverlaps,MARVEL improvesthealignmentsthat

are corrupt due to short sequencing errors. Finally,Quiver can be used after the

assembly phase in order to generate a high-quality consensus sequence, but this

oftenresultsinlongruntimes,especiallywithlargegenomes.Thiscanbeimproved

bycircumventingtheuseofBLASR9bytransformingthepreviouslycalculatedread

overlapsintothepositionofthereadswithinthecontigs.

3.5MARVEL-Computationalrequirements

Foranythingbeyondbacterialsamples,whichcanbeassembledratherquicklyona

standard laptop, a cluster environment with sufficient CPU and storage is

recommended.We recommendhavingat least8GBofRAMperCPUcore for the

clusternodesandafastinterconnecttothecluster-attachedstorage.RAMpercore

demand can be lessened by further increasing the partitioning of the data into

blocks, at the price of increased IOdemand. Example CPU times canbe found in

SupplementaryTable1andhavebeenmeasuredonourSandy-BridgeandHaswell

architecturebasedcluster.Storagerequirementsarehighlydependentoncoverage

andtherepetitivenessofthegenome.Also,storagerequirementsforasinglephase

WWW.NATURE.COM/NATURE | 9

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 10: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

ofMARVELcanbefoundinSupplementaryTable1.Forpeakstoragerequirements,

thisnumberneedstobemultipliedbytwo.

S4:Chicago/HiRiseScaffolding

S4.1Chicagolibrarypreparationandsequencing

AChicago librarywasprepared as describedpreviously3. Briefly, ~500ngofHMW

gDNA(>150kbpmeanfragmentsize)wasreconstitutedintochromatin invitroand

fixed with formaldehyde. Fixed chromatin was then digested with DpnII, the 5’

overhangswerefilledinwithbiotinylatednucleotides,andthenfreebluntendswere

ligated.After ligation,crosslinkswerereversedandtheDNApurified fromprotein.

Purified DNA was treated to remove biotin that was not internal to ligated

fragments. The DNA was sheared to a mean fragment size of ~350 bp, and

sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-

compatible adapters. Biotin-containing fragments were then isolated using

streptavidin beads before PCR enrichment of the library. The libraries were

sequenced on an Illumina HiSeq 2500 in rapid run mode to produce 286 million

2x100bpreadpairs,amountingto93xphysicalcoverage(1-50kbpspanningpairs)of

theSmedgenome.

S4.2HiRisescaffoldingofPacBio-MARVELassembly

AdraftgenomeassemblyofSmed(PacBio-MARVEL,782.1MbpwithacontigN50of

708.7 kbp), Illumina shotgun sequence data (IlluminaHiSeq 2500 rapid runmode,

2x150bp readpairs, SRR5408395),andChicago library readpairs inFASTQ format

wereusedasinputdataforHiRise,asoftwarepipelinedesignedspecificallyforusing

Chicagodataforerror-correctingandscaffoldinggenomeassemblies3.Shotgunand

Chicagolibrarysequenceswerealignedtothedraftinputassemblyusingamodified

SNAPreadmapper(http://snap.cs.berkeley.edu/).TheseparationsofChicagoread

pairsmappedwithindraftscaffoldswereanalysedbyHiRisetoproducealikelihood

modelforgenomicdistancebetweenreadpairs,andthemodelwasusedtoidentify

putativemisjoinsandscoreprospective joins.After scaffolding, shotgunsequences

wereusedtoclosegapsbetweencontigs.

WWW.NATURE.COM/NATURE | 10

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 11: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S4.3AnalysisofHiRisechangestoPacBio-MARVELassembly

MARVELrawcontigsweremappedontoHiRisescaffoldswithDALIGNERtofindthe

exact break positions within the raw contigs. Then, each position was visually

inspected using the layout of the toured overlap graph (similar to Figure 2e, left

graph,maintext),aidedbythedirectvisualizationofreadoverlapswiththeMARVEL

toolLAexplorer.

S4.4ScaffoldgapclosureusingPBJelly

FollowingHiRisescaffolding,weusedPBjelly10 to fillor reduceasmanyscaffolding

gapsaspossible.All632HiRisescaffoldswereusedasinputforPBjelly,aswellasall

patched P4/C2 and P6/C4 chemistry reads longer than 8 kbp. The 632 scaffolds

contained1,256gapsof100bp (HiRisedefaultgapsize).PBjellycompletelyclosed

482gaps,176gapswereclosedpartiallyandoverfilling176gaps,meaningthegap

wasactuallylargerthan100bp.Inadditiontothis,itwaspossibleforPBjellytojoin

32 scaffolds into bigger scaffolds, reducing the overall scaffold number to 600.

OverlapanalysisbetweenthescaffoldsusingDALIGNER(seebelow)showedthat119

PBjelly scaffolds were almost completely contained within other scaffolds and

consistedof>95%repeatbases.Thesewere thereforemoved into thealternative

contig set (_alt, seebelow), leaving481scaffoldsas the finalhaploiddd_Smed_g4

assembly.

S5:AssemblyPolishing

S5.1FinalErrorCorrectionwithQuiver

The final consensus sequence was generated first by a round of the PacBio tool

Quiver.DuetotheruntimeofBLASR8,whichisusedbyQuiver,onlyreadsthatwere

partoftheoverlapgraphofeachcontig(i.e.thetruenon-repeatinducedoverlaps)

wereusedtopolishthecontigsequences.Thisstephadmostly todowiththeAT-

biasandhighrepeatcontentoftheSmedgenome. Inordertoenrichthecoverage

and improve the finalpolishedsequence,additional subreadswereextracted from

the ZeroModeWaveguides and used for polishing (if available). This brought the

coveragetoapproximately40x.

WWW.NATURE.COM/NATURE | 11

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 12: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S5.2Redundancyfiltering

DALIGNERwasusedto identifythehaplotypecontentoftheSmedPacBio-MARVEL

assembly,bymappingofthecontigsagainsteachother.Ifatleast70%ofthebases

inthetwocontigsaswellas70%ofthereadsusedtobuildthecontigswerepartof

both contigs they were considered alternate haplotypes and separated from the

mainassemblyintoasecondsetofalternativecontigs,designatedbya_altsuffixto

thecontigname.Forcontigsshorterthan100kbpthatterminatedinlongrepetitive

sequences,thesethresholdswerereducedto50%.Thisremoved1,509alternative

contigswithatotallengthof168.1Mbp(notethatthisstepwasperformedpriorto

scaffolding).

S5.3ConsensuspolishingusingPilon

Forfine-polishingofresidualPacBiobasepairinaccuracies,weusedPilon11.Thistool

uses theevidence fromshort-read sequencingdata inorder to correct singlebase

differences, small indels, block substitution events, and gaps. Illumina 2x150 bp

shotgun HiSeq 2500 paired-end sequencing data (SRR5408395; see S4.2 “HiRise

scaffolding of PacBio-MARVEL assembly”) was aligned with bowtie212 against the

scaffoldsofdd_Smed_g4andthecorrespondingalignmentswereprovidedasinput

toPilon.SincePilonwasdesignedmainlyforsmallerprokaryoticgenomes,weranit

over each of the assembled scaffolds separately. To assess the assembly

improvements, variants were called from the scaffolding PacBio data provided by

Dovetail Genomics with samtools/bcftools before and after Pilon correction.

Correction was assessed by counting InDels and SNPs per kbp, and independent

rateswere calculated for exons, genic regions andpolished (dd_Smed_g4) vs. raw

assembly(Pacbio-MARVEL).AfterPilonpolishingtherewasa2.24-foldreductionof

indelsinintergenic,a3.25-foldreductioninexonicanda1.94-foldreductioningenic

regions. The changes in SNP counts were minor, indicating that the detected

differencescorrespondmostlytotruevariants.

WWW.NATURE.COM/NATURE | 12

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 13: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S6:RNAseqandTranscriptomes

S6.1RNAextractionandsequencing

For total RNA isolation, 30 sexual animals (~1 cm, 2weeks starved) of the inbred

genome sequencing Smed strain were homogenized in 20 ml Trizol reagent

(Invitrogen) on ice using aMiccra D-8 homogenizer (Miccra) and processed using

Trizol reagent protocol. Following phase separation by chloroform addition and

isopropanol precipitation, the RNA pellet was dissolved in 350 µl RLT buffer and

purified according to the RNeasyMini Kit, including on-columnDNase I digestion.

TheRNAwaselutedinnuclease-freewaterandstoredat-80°C.ThetotalRNAwas

quantified using a NanoDrop 1000 spectrophotometer and RNA integrity was

checkedbycapillaryelectrophoresisonaBioanalyzerusingtheRNA6000NanoKit.

The sample was poly-A selected and sequenced on an Illumina NextSeq 500

sequencer.

S6.2Smed_v6transcriptomeassembly:Rawdata,assemblyparameters,

filtering,annotations

The transcriptome assembly of the asexual strain has been published before13.

Briefly,weusedTrinity14 toassembleRNA-Seqdata fromamixed setofpublically

available data sources (Dresden, Berlin, Muenster). Transcripts were filtered for

contaminants, directionality-corrected and subsequently annotated and filtered by

ORF-content, domains, and RefSeq-protein BLAST hits to define a likely Protein

Coding Fraction (PCF). For many applications, we further sub-filter the longest

isoformofaTrinitygraphcomponent.Suchusesofthetranscriptomeareindicated

by a “PCFL” suffix, e.g. “Smed_v6.PCFL”. Details about the dataflow and the

companion website to access the data are documented in the PlanMine13

publication.

S6.3Smes_v1transcriptomeassembly:Rawdata,assemblyparameters,

filtering,annotations

The transcriptome assembly Smes_v1 of the sexual strain ofSmedwas assembled

from SRR955511 using the same pipeline as for Smed_v6, except that the strand

WWW.NATURE.COM/NATURE | 13

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 14: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

correction step and ordering of transcript numbers according to expression level

were omitted. ORF directionality is therefore randomized in the Smes_v1 version.

The PCFL-suffix, e.g., “Smes_v1.PCFL” designates the longest graph component

subsetoftheproteincodingfraction,asdefinedabove.

S7:AssemblyQualityControl

S7.1Transcriptomeback-mapping

De novo assembled transcriptomes of the asexual (Smed_v6) and sexual strains

(Smes_v1) available at the PlanMine13 resource were used to assess the

completenessof thedd_Smed_g4assembly. For thispurpose, the longest isoform

perTrinitygraphcomponentoftheprotein-codingfraction(PCFL)wasselectedand

mappedtothedd_Smed_g4gassemblyusingtheGMAPaligner15,withaminimum

query coverage cut-off of 0.6 and aminimum identity cut-off of 0.6. The resulting

mappingswerethenusedtoinvestigateuniquelymapping,multi-mapping,andnon-

mappingtranscripts,alongwithputativechimerictranscripts.

S7.2Genomecompletenessanalysis

Thetwoback-mappedtranscriptomeswereusedtoestimatedd_Smed_g4assembly

completeness.Thefactthat98.7%oftranscriptsofthesexual(Smes_v1)and97.3%

oftheasexual(Smed_v6)trancriptomemappedtothedd_Smed_g4assemblyunder

our mapping criteria demonstrates that the genome is largely complete. The

remaining non-mapping transcripts (538 or 667 in the Smes_v1 and Smed_v6

transcriptomes, respectively) could either represent true positives, i.e. genes

genuinely missing from the dd_Smed_g4 assembly. Alternatively, they could

represent transcriptome contaminants (e.g., from commensal organisms or other

sourcesofsequencingnoise)andthusfalsepositives.Todistinguishbetweenthese

twopossibilities,weBLASTsearched(blastp)allnon-mappingSmes_v1andSmed_v6

transcriptORFsagainstNCBIRefSeqandrecordedthespecies fromwhichthebest

blastp hit of each transcript originated. We then checked if any species was

specificallyover-representedfornon-mappingtranscriptsusingtheenricherfunction

fromtheClusterProfilerRpackage16. Inthisregardwefoundoverrepresentationof

WWW.NATURE.COM/NATURE | 14

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 15: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

the protist species Tetrahymena thermophila, Ichthyophthirius multifiliis, and

Paramecium tetraurelia in Smes_v1,whilstCaenorhabditis elegans andBos taurus

were over-represented in Smed_v6. After the annotation of these transcripts as

likelycontaminants,wefurtherclassifiedtheremaininggenesaslikelymissinggenes

if they uniquely mapped to the SmedSxl v4.017 assembly and were annotated in

PlanMine13ashavingorthologuesinatleast5otherplanarianspecies.Theremaining

transcripts that did not fit the contaminant or likely missing gene criteria were

markedas“unknown”andarelikelyenrichedforassemblynoise.

Altogether, our analysis identified 46 out of a total of 31,966 Smed transcripts as

genuinelymissingfromthedd_Smed_g4assembly,amountingtoacompletenessof

99.9985%.Further,allmissinggenescouldbe identifiedonsmall contigs thathad

beenremovedbyourinitialassemblyfiltering.Wethereforeconcludethatbasedon

themappingofprevioustranscriptomeassemblies,thegenomeisofhighquality.In

contrast, if we consider Smes_v1 transcripts that map to our assembly

(dd_Smed_g4),butdonotmaptothepublishedSangergenomeassembly(SmedSxl

v4.0)17thenweidentify2,193suchtranscriptsofwhich1,229alsowereannotatedin

PlanMine as having homologues in at least 5 other planarian species.Overall, this

analysis thereforedemonstratesconclusively that thedd_Smed_g4assembly is the

mostcompleteSmedgenomeassemblytodate.

S7.3Transcript-basedgenomeassemblyqualitycontrol

We also used transcriptome backmapping to further probe the quality of the

genomeassembly.Sinceanentiredenovoassembledtranscriptomecannotprovide

anerror-free“groundtruth”dataset,wefirstsub-selectedahighconfidencesetof

SmedtranscriptsfromourSmes_v1denovotranscriptomeassemblyonPlanMine13.

Specifically,wealignedprimaryORFs(i.e.thelongestnon-overlappingopenreading

frame of each transcript) against the 7 other in-house assembled planarian

proteomes currently available in PlanMine. The searchwas performedwith blastp

usingtheBLOSUM45similaritymatrixandane-valuecut-offof1E-3.Blastpresults

were filtered for subject and query coverage of at least 90 % in at least 5 other

transcriptomes.Altogether, thisanalysis identified1,509SmedORFs thathadclear

WWW.NATURE.COM/NATURE | 15

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 16: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

full-length homologues in multiple other planarian species and therefore

representedhighconfidencecDNAs.

To stringently probe the quality of the dd_Smed_g4 assembly using the high

confidence cDNA (HC-cDNA) subset, we further sub-filtered the previous GMAP

transcriptomemappingresults (seesectionS7.1) foraminimumquerycoverageof

0.9andaminimumidentitycut-offof0.9andcategorizedtheHC-cDNAasuniquely

mapping, non-mapping,multi-mapping. As shown in Extended Data Fig. 2a, 1,432

out of the 1,509 (94.90 %) transcripts mapped uniquely to the dd_Smed_g4

assembly, despite the high stringency and coverage filter criteria. Therefore, this

provides a strong indication that the underlying gene sequences were correctly

assembled. Of the 10 non-mapping transcripts (0.66 %), only 1 represented a

genuineassemblymistakelocatedatacontigend,whereby4ofthe5’-exonswere

invertedandappendedtothe3’-endofthegene(ExtendedDataFig.2b,c).Also,2

ofthenon-mappingtranscriptsrepresentedGMAPfalsepositives(bothwereBLAT-

mappablewith>0.9querycoverageand identity)and theother7were transcripts

already identified as “missing gene (4)”, “unknown (2)” or “putative contaminant

(1)”.Together, thehighconfidencegeneanalysis thereforeconfirmsboth thehigh

assemblyqualityandcompletenessofthedd_Smed_g4assembly.

Furthermore, 67 out of the 1,509 HC-cDNA (4.44 %) were classified as multi-

mapping. Manual inspection demonstrated that all of these genes indeed had

multiple(usually2)“perfect”maplocationsinthedd_Smed_g4assembly.Some,e.g.

dd_Smes_v1_25458 (Extended Data Fig. 2d) mapped on opposite strands within

non-repetitive and high-quality regions of the genome, thus representing likely

biological gene duplications. However, other multi-mapping locations like that of

dd_Smes_v1_28816,weresituatedinlow-confidenceregionsclosetoassemblygaps

(identified by bedtools closest18), indicated by lack of a respectiveQuiver track in

that region (Extended Data Fig. 2e). Indeed, we found that the 67multi-mapping

transcriptsmappedclosetocontigends incomparisonwithuniquelymappinghigh

confidencegenes(ExtendedDataFig.2f,g)andthesizeoftheduplicatedgenomic

region was below 5 kbp in all cases. The size of the duplicated regions was

determined by mapping transcripts containing intronic sequences to the

dd_Smed_g4 assembly usingGMAPwith parameters --chimera-margin=200, --min-

WWW.NATURE.COM/NATURE | 16

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 17: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

trimmed-coverage=0.9 and --min-identity=0.9. If the transcript with intronic

sequencesstillmappedtomorethanonelocusinthegenome,aflankingwindowof

2kbp(1kbpup-and1kbpdownstream)wasaddedtothetranscriptsequenceand

GMAPwasrunagain.Thisprocesswasrepeateduntilonlyone locationwas found

foragiventranscript.

Altogether, this indicates that themajority ofmulti-mapping transcripts represent

over-assembly artefacts in the dd_Smed_g4 assembly that are too small to be

detectedbytheChicago/HiRisescaffoldingmethod.Likely,thesecontigendmistakes

ariseasaconsequenceofduplicatedassemblyofhaplotypesandrepresentaknown

probleminassemblinghighlyheterozygousgenomes19.

However, their low overall proportion minimally impact the quality of the

dd_Smed_g4 assembly and the various quality control tracks provided in the

PlanMine genome browser (e.g., gap locations, mappability, quivered intervals)

providetoolsfortheinterpretationofdubiouscases.

Inter and intra-contig similarity in Extended Data Figure 2c-e was visualized using

miropeats20withthedescribedthresholds.

S7.4Contaminationcheck

Although all animals used for genome sequencing were exposed to a stringent

antibiotic treatment prior to DNA extraction, the sequencing of whole animal

extracts always comes at the risk of co-sequencing parasitic or commensal

organisms. To assay for the putative presence of non-Smed contigs in our

dd_Smed_g4 assembly,we screened all contigs for GC-content and coverage (this

method was recently used to identify contamination in the original Hypsibius

dujardini Tardigrade genome assembly)21. Coverage of dd_Smed_g4 contigs was

computed by first aligning reads from SRR954965 and SRR849810 to dd_Smed_g4

usingbowtie212, andestimating coverageper scaffoldwithgenomeCoverageBed18.

GCcontentwasdeterminedusinginfoseqfromtheEMBOSStoolssuite22.Wecould

not detect major contig clusters that deviated from the Smed GC-content or

haploid/diploid read coverage, thus indicating lack of bacterial or other

contaminationintheassembly(notshown).

WWW.NATURE.COM/NATURE | 17

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 18: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S8:UCSCGenomeBrowserTofacilitateaccess to thedd_Smed_g4assembly,webuiltacustomversionof the

“UCSC Genome Browser”. Further, we integrated the browser into PlanMine, the

planarian transcriptome platform maintained by our group13. The genome

enhancement thus further improves the utility of PlanMine towards its mission

objectiveascommunityplatformfortheplanarianmodelsystem.

The UCSC Genome Browser23 source code was downloaded from GitHub

(https://github.com/ucscGenomeBrowser/kent) and several tracks were made

available,whicharebrieflydescribedinthefollowing.

S8.1GEM-Mappabilitytrack

Overall genome mappability was determined by the GEM mappability program24

(GEM-binaries-Linux-x86_64-core_i3-20130406-045632). Briefly, using the fast

mapping-based algorithm, it produces mappability scores for different lengths of

NGSreadsonthegenome.AhigherGEM-scoreatagivenpositionindicatesahigher

chanceofuniquelymappingareadof therespective lengthtothatpositionof the

genome.Regionswithlowerscorearelessuniqueandwilllikelyproduceambiguous

mappings for the given read length. Generally, duplicated and repetitive regions

generate lower mappability scores. One exception are repetitive regions in the

assemblythatwerenoterror-correctedbythePacBioQuivertool(seeS8.5).These

regions appearmore unique, due to the higher error rate and tend to have high

mappabilityscores,butcaneasilybespottedbytheabsenceofaQuivertrack.

Togeneratethemappabilitytrack,standardparametersalongwithareadlengthof

200 were applied. After that, the output files were converted to bigwig using

standardparameters.The“mappability_200”trackisavailableunder“Mappingand

SequencingTracks”atthecustomUCSCGenomeBrowser.

S8.2RepeatMasker–Repbasetrack

Inordertoprovideinformationaboutknownrepeats(i.e.RepBase)aRepeatMasker

runusingparameters:-species“schmidteamediterranea”and-s(forsensitive)was

performed on the dd_Smed_g4 assembly. Also, RepeatMasker25 was configured

using rmblastn (v2.2.28+) andTRF (v4.09)26. Theoutput information is available at

WWW.NATURE.COM/NATURE | 18

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 19: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

thecustomUCSCGenomeBrowserunder“VariationandRepeats”sectionandtrack

named“RepeatMaskerRepbase”.

S8.3RepeatMasker–RepeatModelertrack

Inordertoalsoprovidedenovorepeatprediction,RepeatModeler(v1.0.8)27wasrun

on the dd_Smed_g4 assembly. Standard parameters were used and the tool was

configuratedwithRepeatMasker patched versionof RECON (v1.08)28, RepeatScout

(v1.0.5)29, TRF (v4.09)26, rmblastn (v2.2.28+). The resulting track is available at

“VariationandRepeats”sectionunder“RepeatMasker-denovo”.

S8.4AlternativeScaffoldstrack

The alternative scaffolds track highlights genomic regions forwhich an alternative

haplotypeexists.Eachalternativehaplotype(indicatedwith“_alt”suffix)isavailable

inthebrowser(describedindetailinsectionS12.3).

S8.5Quivertrack

Quiver status, described indetails in section S5.1, is representedgenomewideon

this track. LackofQuiver coverage indicates thata regionwasnoterror corrected

andthattheunderlyingsequenceistobeinterpretedwithcaution.

S8.6PacBiocoveragetrack

PacBiorawdata(P4/C2andP6/C4)weremappedtodd_Smed_g4usingbwa-mem30

with-xpacbiooption.Usingsamtools31andBEDTools–genomeCoverage18,coverage

informationwasgeneratedanddisplayedonthistrack.

S8.7Transcriptstrack

PlanMine13 Smed assembled transcripts (Smes_v1 and Smes_v6) were mapped to

dd_Smed_g4assemblyusingGMAP(describedinsectionS7.1).

Coverage informationwasfetchedfrommappingrawdatausedfortranscriptomes

assemblies from PlanMine to dd_Smed_g4 using STAR32 with default parameters,

samtools31andBEDToolsGenomeCoverage18wereusedtobuildthetrack.

WWW.NATURE.COM/NATURE | 19

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 20: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S8.8LinkstoPlanMine

The customUCSCGenome Browser version is embedded in PlanMine available at

http://planmine.mpi-cbg.de.Also,eachSmedtranscriptislinkedbetweenPlanMine

andtheUCSCGenomeBrowser.

S9:RepetitiveElementPrediction

S9.1Denovopredictionofrepetitiveelements

An initial prediction of repetitive element classes was performed using

RepeatModeler(v1.0.8)27andtheresultinglibrarywasclassifiedusingRepeatMasker

(v4.0.6)25 using libraries fromRepBase (20160829 snapshot)33. TheRepeatModeler

librarycontainedlargenumbersofmodelsthatcouldnotbeclassified(classification

‘Unknown’) and which, together, masked 25.29 % of the assembly. Uponmanual

inspection,highproportionsofthemodelswerechimericorcontainedelementson

varying strands. Removal of these models from the library increased masking

ascribed to other repeat classes, also suggesting conflict between the models

producedbyRepeatModeler.

A library was manually created to ameliorate these issues by compiling repeat

elementclassesfromdifferentsources,orfromRepBasewherenodifferenceswere

notedbetweenmaskingwithRepBaseentries in comparison togeneratedmodels.

Simple repeats and DNA elements were taken directly from RepBase using the

RepeatMasker“queryRepeatsDatabase.pl”script.MaskingofLINEelementswiththe

RepBase entries was poor in comparison to using models produced by

RepeatModeler(4.39%versus10.98%genomemasking,respectively)andthelatter

were included in the library. LTR retroelement models were produced through a

separatepipeline(seeS10.1“LTRelementconsensusconstruction”).

A hard-masked assembly was generated using this library and this assembly was

again used as input into RepeatModeler. Masking from models classified as

‘Unknown’ was reduced to 14.39 %. Again, manual inspection revealed these

contained high proportions of fragmentary and chimeric models (1,502 of 1,768

models)andthesewereexcludedfromdownstreamanalysis.

WWW.NATURE.COM/NATURE | 20

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 21: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

For all masking, RepeatMasker was configured with rmblastn (v2.2.27+) and TRF

(v4.09)26andwasruninsensitive(-s)mode.

S10:LTRAnalysis

S10.1LTRconsensusconstruction

LTR elements were identified through an approach combining outputs from 4

predictionprograms:LTR_FINDER(v1.0.6)34(runwithandwithoutinbuiltmaskingof

highly repetitive sequences), LTRharvest (GenomeTools v1.5.9)35, MGEScan-LTR36,

andRed37.Whereprogramsexposetheoptions,maximumelementsizeconstraints

were set to 50 kbp and minimum LTR sizes to 150 bp. As this was program-

dependent,however,allidentifiedfeatureswerefurtherfilteredonthepresenceof

>150 bp LTRs and to be between 2 and 50 kbp in length. Passing features

overlapping>=50%weremergedusingBEDOPS(v2.4.20)38 togivethe locationsof

finalpredictions.

Predicted elements were translated in all frames and scanned using hmmsearch

(HMMER v3.1b2)39 for the 'rve' (PF00665.21), 'RVP' (PF00077.15), and 'RVT_1'

(PF00078.22)PfamHMMs40,representingintegrase(IN),protease(PR),andreverse

transcriptase (RT), respectively. Matching regions were extracted for all elements

and separate alignments built for IN, PR, and RT using MUSCLE (v3.8.31)41. For

elementscontainingall threeHMMfeatures,alignmentswereconcatenatedanda

neighbour-joining tree built with MUSCLE. Clusters were generated using

tree2clusters (USEARCH v9.0.2132)42 using aminimum identity threshold of 80%.

Not all elements contained all three HMM features and were represented in the

tree,thusclustersweremanuallyrefinedbasedontheindividualtreestomaximise

inclusionandtoexcludeelementswithsizes>=2SDsfromthemean.

Nucleotide sequences for all elements of each cluster were extracted from the

assembly and aligned with MAFFT (v7.271)43. Neighbour-joining trees were built

using MUSCLE and any sub-clusters identified using USEARCH were separated.

Levitsky consensus sequenceswere calculated and annotated using an automated

pipelinewithinUGENE(v1.24.2)44.ConcatenatedIN,PR,andRTsequencesfromthe

consensuseswereaddedtothepreviousalignmentandthetreeinspectedtoensure

WWW.NATURE.COM/NATURE | 21

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 22: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

representation of all elements. Finally, thiswas confirmed by using the consensus

sequences as a custom library tomask the assembly using RepeatMasker and by

comparisontotheinitialpredictions.

To streamline further analysis, the LTRs were replaced with consensus sequences

formed for each of the (sub-) clusters individually. To match the methods and

nomenclatureusedinRepBase,theLTRandinternalsequenceswereseparatedfor

theRepeatMaskerlibrary.

S10.2GypsyLTRelementannotation

ConsensuselementswereannotatedwiththelocationsassociatedwithPfammodels

used in their discovery (‘rve’, 'RVP', and 'RVT_1') and additionally with ‘RNase_H’

(PF00075.19).tRNAmodelspredictedbytRNAscan-SE(v1.4)45werescannedagainst

the consensus elements using blastn (v2.2.30+)46 and the best matching region

annotated. HMMs compiled from the GyDB 2.0 collection47 were also tested,

alongside the entire PfamA and PfamB HMM sets. CCHC-type Zinc finger

annotations,knowntobeassociatedwithretroviralgagnucleocapsid(NC)proteins

couldbeidentified,howeverfurtherhitswereoflowconfidenceandwereomitted.

AdditionalputativedomainswerepredictedusingScanProsite48andCD-search49.

S10.3TheBurrofamily

We detected 14 classes of gypsy LTR elements in the Smed genome, all of which

share thecentralprotease (PR), reverse transcriptase (RT), ribonucleaseH (RNAse)

and integrase (INT) module of Ty3-gypsy-like retroelements. However, a

phylogenetic analysis of the conserved PR/RT/Int/RNase module could not assign

any of the 14 classes to known retroelement classes. Themost striking structural

featureof theSmed retroelements is themassive sequenceexpansionofgroups4

(Burro-1), 5 (Burro-2) and 13 (Burro-3), which results in LTR elements of 25 kbp

flankedby5kbpLTRs(comparedtothe5-7kbpoftypicalvertebrateLTRelements).

TheOgre-elementsfoundinplantssofarprovidetheonlyexampleofsimilarly-sized

LTRs,butprimingby tRNAPROinsteadof tRNAARG inOGREsand the lackofanopen

reading frame upstream of the likely nucleocapsid protein (zinc knuckle domains)

indicatethattheSmedelementsarelikelynotrelatedtoOGREelements50.Further,

thepresenceofputativenucleaseandBIR-domainsthatarenotpartof thetypical

WWW.NATURE.COM/NATURE | 22

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 23: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

domaincomplementofMetaviridaefurther indicatethatthegiantSmedgypsyLTR

elementsrepresentlikelynovelbranchesoftheMetaviridae,theunusualcomplexity

ofwhichharboursthepotentialforcomplexlifecyclesorinteractionswiththehost

transposondefencemachinery.

S10.4LTRelementexpressionanalysis

LTR element expression was estimated by TETranscripts51. Briefly, TETranscripts

estimatesboth, theexpressionof transposableelements (TE) andgene transcripts

takingasinputsTEandgeneannotationsandalignmentfiles(i.e.bamfiles).RNA-Seq

reads (SRR5408394)weremapped to the genomeusing STAR (v2.5.2a_modified)32

with--winAnchorMultimapNmax100,--outFilterMultimapNmax100.

AcustomGTFcontainingcoordinatesofSmed_v6transcriptsthatuniquelymapped

tothegenomebyGMAP15(--chimera-margin=200,--min-trimmed-coverage=0.6and

–min-identity=0.6)inadditiontothegenomewereusedasinputtobuildSTARindex.

Afterthereadalignment,atablewithrawcountswasgeneratedusingTETranscripts

(defaultparameters).

S10.5LTRelementdivergenceanalysis

Kimurasubstitutionlevelsofmaskedregionsagainsttheircorrespondingmodelare

automatically calculated by RepeatMasker and output by default. These were

compiled for each (sub-) cluster and plotted as a means of assessing their

divergence. LTR element agewas calculatedusingmutation rates estimated forC.

elegans52 (per generation: 2.70E-09 / per year 2.46375E-07). Assuming C. elegans

mutationrates,LTRelementinvasionwascalculatedtobe~10to60Mioyearsold

(SupplementaryTable1).

S10.6Solo/pairedLTRelementabundanceanalysis

Wehave estimated the proportion of solo LTRs and complete or nearly complete

LTR/Gypsy.SoloLTRsweredefinedasevents that lackany traceof internal region

andhavealengthofatleast60%ofthespecificconsensussequence.Completeor

nearlycompleterepeatelementsweredefinedaseventsthathaveLTRsflankingan

internal regionandanoverall length (LTRsplus internal region)ofat least60%of

theconsensussequence.

WWW.NATURE.COM/NATURE | 23

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 24: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S10.7Scaffoldendanalysis

Toexaminetheeffectofdifferentrepeatclassesontheexistingassembly, their

occurrence-frequency at scaffold ends was compared to the expected chance

occurrence of the class. For each scaffold in the assembly, the terminal 1 kbp

endswereexaminedforoverlapwithrepeatannotationsusingBEDTools18and

scoredasrepetitiveifmorethan600bpwerecoveredbyrepeatannotations.For

calculating the sum of repetitive sequence length, only the actual bases

overlapping the repeat annotation were counted. Chance occurrence was

determinedbyexamininga setof1,000randomgenomic regions (1kbpeach)

forrepeatoverlapannotationasabove.Scaffoldterminiwereexcludedfromthe

randomgenomicregionsset.

S11:AAT-RepeatAnalysis

S11.1AAT-microsatellitepredictionandreadalignmentbreakanalysis

Theanalysisofmicrosatelliteswasbasedona tandemrepeat finderannotationof

the dd_Smed_g4 assembly. All variations of AAT* patternswere converted into a

Marvel interval trackand incorporated intoaMarveldd_Smed_g4contigdatabase.

Patched P6C4 reads (coverage 20x) were mappedwith DALIGNER on the

dd_Smed_g4 assembly. For each of the AAT* regions, the alignments of P6/C4

reads were analyzed and the fraction breaking within the AAT* region versus all

readsspanning500basesleftandrightoftheAAT*regionwasdetermined.Breaking

andspanningreadswerebinned(binsize:180bp)accordingto lengthoftheAAT*

repeatwithinthe interval[100,1000].Bin1:100-280bp,avg166bp;bin2:280-460

bp, avg351bp;bin3:460-640bp, avg536bp;bin4:640-820bp, avg719bp;bin5:

820-1000bp,avg900bp.

S11.2MicrosatelliterepeatlengthestimationbyCircularConsensus

Sequencing(CCS)

Inordertodeterminetheoriginofthelengthvariationscausingthereadalignment

breaksoverAT-regions,circularconsensussequencing(CCS)53readswithaminimal

WWW.NATURE.COM/NATURE | 24

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 25: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

genomecoverage(<1x)weregenerated.For2kbpCCSlibrary,2µgofAMPureXP

beadpurifiedhighmolecularweightgenomicDNAwasshearedto2kbpfragments

usingclearminiTUBEsontheCovarisS2instrumentaccordingtothemanufacturers

conditions.Themajorityoffragmentswereabout3.9kbpinsize.Ofthis750ngof

shearedgenomicDNAwentintoPacBioSMRTbelllibrarypreparationasdescribedin

“Procedure and Checklist – 2 kbp template preparation and sequencing”with the

exceptionthatthebluntsequencingadapterwasused inafinalconcentrationof5

µM.Primerdimersandshort fragmentswereremovedafter librarypreparationby

2x0.6XAMPureXPbeadwashingsteps.SMRTbelllibrarieswereloadedontoPacBio

RSII SMRT cells after primer annealing and polymerase binding, applying the

MagBead loading protocol. Sequencing was done on the RSII instrument with P6

polymerase and C4 sequencing chemistry and 360 min movies to generate

polymerase reads that lead to circular consensus sequences (CCS). The sequences

weresubmittedtotheSRAunderaccessionSRR5408393.

S11.3OverlaplengthvariationinCCSandP6/C4reads

CCS raw reads and patched P6/C4 reads were both mapped to the dd_Smed_g4

assemblywithDALIGNER. TheCCS consensus readswere generatedbypbccswith

the use of the default parameters andwere only used to detect uniquemapping

locationsforCCSreads,whileambiguouslymappedP6/C4readswereexcludedfrom

theanalysis.

Fromthis,AATrepeatswerepredictedasfollows:

1. AAT-regionswere selected,by choosing regions that canbe spanned (~100

bp anchor right and left of AAT region) by uniquely mapping (see S11.3)

circularconsensussequences(CCS).Thisresulted in1,141uniqueregions in

thedd_Smed_g4assembly,withlengthbinsrangingfrom300–1000bp.

2. To eliminate sequencing coverage differences between different regions of

thegenome,5randomCCSand5randomP6/C4readswereselectedforeach

ofthose1,141dd_Smed_g4regions.

3. BoxplotsforthefractionofCCSoverlaplengthreadsvsoverlaplengthinCCS

consensusreadandfractionofP6/C4overlaplengthvsdd_Smed_g4overlap

lengthwerecreated.

WWW.NATURE.COM/NATURE | 25

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 26: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

4. Theplotincludesonlylengthbinsintherangeof400-800,because

a. Thereareonlyasmallnumberofeventsinbinsof900-1,000

b. For bin 300 the difference between unique andAAT regions is very

smallasAATregionscanhaveupto200anchorbases.

ThevariancebetweentheCCSreadsanduniquesegmentsofthegenomecompared

tothevariancebetweenstandardP6/C4readsanduniquesegmentsofthegenome

wereunder theerror-rateofstandardP6/C4PacBioreads.Thesamewasthecase

forthevariancebetweentheCCSreadsandATT*regionscomparedtothevariance

of standard P6/C4 and AAT* regions. However, if the AAT* loci within the

dd_Smed_g4assemblydidvary,itwouldbeexpectedthatthisvariancewasoutside

theexpectederror-rateofPacBioreads.Singleregionswerefoundthatdidhavea

high variance. However, at this point we cannot exclude that these were due to

differences between haplotype differences being observed. Therefore, a biological

source for the variance within AAT* loci could not be found and it can only be

concludedthattheseareofatechnicalorigindueresultingfrompairwisealignment

ofrepeatregions.

S11.4GenomiccontextofAT-richregions

To characterise the genomic location of AT-rich microsatellite repeats, genomic

features were defined on basis of the mapping results of the Smes_v1

transcriptome13againstthedd_Smed_g4assembly.Then,locationsofthepreviously

defined AT-rich microsatellite repeats were extracted using IntersectBed18. An

overlapofat least60%of thequery (>100bpevents) toagivengenomic feature

wasrequiredtobescoredaseitherintergenic,intronicorexonic.

S12:AssemblyHeterozygosity

S12.1Drosophilaassembly

For the comparison between the dd_Smed_g4 assembly and a D. melanogaster

assembly,weused50xcoverageofthepubliclyavailableSRAdataset(SRX4999318)

withareadlengthcut-offof14kbp.TheresultingstandardMARVELassemblyhada

WWW.NATURE.COM/NATURE | 26

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 27: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

contigN50of20Mbpandthe longestcontigwas28Mbp.Theassemblydisplayed

almostnoalternativepaths(bubbles)intheoverlapgraph,indicatingahighlevelof

homozygositywithin thedataset.A representative 1.7Mbpportionof the longest

contigwasusedtoillustratethedifferencesbetweenD.melanogasterandSmed in

Fig.2e.

S12.2Dotplotanalysis

TheanalysisandcorrespondingdotplotinFig.2fwasgeneratedbyMummer

(v3.2.3)54.

S12.3Coverageanalysisaltvs.maincontigs

To quantify genome-wide coverage differences between alternative and main

contigs,weonlyconsideredindelsasregionsofunambiguousdifferencebetweenalt

andmain contigs. For each indel region, the fraction of breaking read alignments

versusthecoverageofspanningreadswasquantified.OnlypositionsofnonAT-rich

indelsoflength>99baseswereconsidered.Thecoverageofbreakingandspanning

alignments was limited to [5, 25] and indel lengths had to be consistent at each

position. TheanalysiswasbasedonMARVELassembled contigs (no correction,no

scaffolding), and on the trimmed P6/C4 reads. 1,509 alternative contigs (bubbles)

that mapped to 1,495 unique locations on 695 contigs in dd_Smed_g4 were

considered (14alternative regions (~1%) representednon-uniformstructuressuch

asstackedalternativecontigs).Alternativecontigshadanaverageof13breakevents

(includingAAT*patternsandinsertsofanysize)withamedianspacingof6,676bp.

S13:GeneAnnotation

S13.1Geneannotationandgenenumberestimate

Thetranscriptomeofthesexualstrain(Smes_v1,PCFL)wasmappedtothe

dd_Smed_g4assemblyusingGMAP15withaminimumquerycoverageof0.6and

minimumidentityof0.6.Ourtranscriptomecontains~31000non-redundantand

likelyprotein-codingtranscripts.Thisnumbercorrespondswelltoaprevious

estimateofgenenumbersintheplanariangenome54.However,theactualnumber

WWW.NATURE.COM/NATURE | 27

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 28: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

ofSmedgenesislikelysubstantiallylowerduetofragmentedtranscriptsor

transcribedrepeatelements.

S13.2Genestructure

Gene structure statistics were calculated for a set of high-confidence transcripts,

identifiedonbasisofi)uniquemappinglocationinthedd_Smed_g4assemblyandii)

havingahomologueinatleast4otherspecies.Intotal,12,413transcriptsfulfilledall

criteria.Themedian/meantranscriptsizewas1,545/1,875bpandmappedlength

1,534 / 1,866 bp, respectively. Median/mean gene size was 7,473 / 15,395 bp,

numberof4/5.76introns,andintronlengthof1,329/2,694,respectively.

S13.3Orthologyanalysis

ToestablishanorthologymodelofSmedgenes,weperformedapair-wiseBLASTof

(a) 20 selected Ensembl reference proteomes and (b) the proteome of Smed. To

definethelatter,weusedtheSmes_v1.PCFLtranscriptomeofthesexualstrainand

extracted justtheprimaryOpenReadingFrame(ORF) (i.e. longestnon-overlapping

ORF)ofeachtranscript.BLASTsearchwasperformedusingblastpusingaBLOSUM45

similarity matrix and e-value cutoff of 1E-3. BLAST results were pre-filtered for

subjectandquerycoverageofatleast10%andpost-filteredforreciprocalbesthits.

Resultingdatawere clustered intoorthology groups (i.e. graph components)using

the igraph package in R56. Orthology groups were assessed with respect to

presence/absenceofagivenspeciesand/orcombinationofspecies.

S14:DivergenceAnalysis

S14.1Divergenceanalysis

Orthologuegroupsof52highlyconservedproteinsacross22speciesweregenerated

usingtheorthoMCLpipeline(fordetailsseeS16.1)(SupplementaryTable2).Briefly,

orthogroupsthatcontainedallspeciesandwhereonlyoneortwospecieshadmore

than1orthologuewereselected.Ambiguousaminoacidresidues (letter“X”)were

first removed from each of the sequences. Each orthologue set was then aligned

using PRANK57 (v10603, parameters: 10 iterations, +F) and trimmed using trimAl58

(v1.4rev15,parameters:gappyout).Theorthologuesetswerethenconcatenatedto

WWW.NATURE.COM/NATURE | 28

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 29: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

formasupermatrixusingaperlscript.Maximumlikelihoodanalysiswasperformed

using RaxML (v8.2.959 with the PROTCATLG model. The topology of non-

Platyhelminthesspecieswasconstrainedusingaguidetree(basedonthephylogeny

byCannonetal.2016)60.

S15:SyntenyAnalysis

S15.1Syntenycomparisons

We used the dd_Smed_g4 assembly as the reference genome and computed

pairwisegenomealignmentstootherPlatyhelminthes(query)genomesusinglastz61

(v1.03.54;parametersK=2400L=3000Y=3400H=2000,HoxD55scoringmatrix)and

thechain/netpipeline62(parameterschainMinScore1000,chainLinearGaploose).In

addition, we used highly-sensitive local alignments with lastz parameters K=1500

L=2500 and W=5 to find co-linear alignments in the un-aligning regions that are

spannedbylocalalignments(gapsinthechains)63.Thesameprocedurewasapplied

to align theM. lignano genome64 to other Platyhelminthes including Smed. For

comparison,we also aligned the human genome against frog (xenTro7 assembly),

zebrafish(danRer10)andlamprey(petMar2).

S16:PlanarianSpecificGeneSequences

S16.1Identificationofplanarian-specificgenes

To search for putative novel or planarian-specific genes in Smed,wemade use of

OrthoMCL,which groups putative orthologues and paralogs using aMarkov Chain

algorithm65.Here,weappliedOrthoMCL(v2.0.9)totheanalysisofasetof27species

(Supplementary Table 3) representing the taxonomic groups Annelida (n=1),

Arthropoda(n=4),Chordata(n=3),Cnidaria(n=1),Ctenophora(n=1),Mollusca(n=2),

Nematoda (n=2) and Platyhelminthes (n=13,which included the transcriptomes of

n=6 planarian species). To generate the input similarity matrix for OrthoMCL, we

carriedoutanall-against-allreciprocalblastpanalysisamongstthepredictedprotein

sets of the 27 species with -matrix “BLOSUM45” and -evalue 1e-5 parameters.

OrthoMCL was run with a parameter of “-I 1.5”. Only OrthoMCL

WWW.NATURE.COM/NATURE | 29

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 30: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

orthologue/paralogue groups comprised exclusively of flatworm entries were

consideredforfurtheranalysis.

Next, theSmed sequences(derivedfromSmes_v1.PCFLORFs)wereextractedfrom

flatworm-specific orthogroups and further sub-filtered according to the following

criteria:

1. ORFlengthequalorgreaterthan160aminoacids.

2. Similarity (blastp -evalue lower or equal to 1e-15) to at least one other

distantplanarianspecies(Dlac,Pten,PnigorPtor).

3. NoRefseqBLASThit(-evaluelowerorequalto1e-3)innon-Platyhelminthes

species(RefSeqBLASTdatabasesnapshotfromMay24th2016).

4. Mappinglocationinthedd_Smed_g4assemblyhaslowoverlapwithrepeats

(lessthan60%ofORFlength).

5. Mapping location in the dd_Smed_g4 assembly has no other transcripts

mappedwithin50bpupstreamand150bpdownstreaminordertoexclude

fragmented transcripts. Manual inspection confirmed that the latter two

filtering criteria greatly reduced the presence of transcript fragments

representing poorly conserved protein regions, yet the presence of some

suchsequencesinthefinaldatasetispossible.

Altogether,1,165transcriptspassedthefilteringandwethereforerefertothisdata

set as flatworm-specific genes (Supplementary Table 4). Given the focus on

transcribed sequences and the additional application of ORF length criteria, these

largelyrepresentactivelytranscribed,protein-codinggenesthathavenodetectable

sequence similarity outside the phylum by standard BLAST criteria (step 3). Our

analysis underestimates the content of “new” genes in the Smed genome, since

recently repeat-derived sequences are excluded by filtering (step 4) and the

requirement for similarity in distant planarian species (step 2) removes recently

generated or rapidly evolving gene sequences. Given the significant sequence

divergencebetweenplanarianspeciesatthetranscriptomelevel(notshown),itisin

factvery likelythatmanysuchgenesexistandthatplanariansthereforeconstitute

aninterestingmodelsystemforstudyingtheevolutionofnewgenes.

WWW.NATURE.COM/NATURE | 30

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 31: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

S16.2Domainpredictions

Domainsofflatworm-specificgeneswerepredictedusingSMART66(“Normalmode”

withthecheckedoptions“Outlierhomologuesandhomologuesofknownstructure”,

“PFAMdomains”,“signalpeptides”and“internalrepeats”)andInterProScan67with

SUPERFAMILYandPfamapplications.

S16.3Expressionanalysis

To check the expression of flatworm-specific genes in Smed, transcript expression

data sets for embryogenesis (SRA accession No.: SRR3629913 - SRR3629944),

regeneration (SRR2051616 - SRR2051633) and stem cells (SRR2051616 -

SRR2051633)werefetchedfromtheNCBI–SRAdatabase.

All data were mapped to the dd_Smed_g4 assembly using STAR32 (standard

parameters) and a custom GTF file with genomic coordinates of mapped ORFs.

Differential expression analysis was carried out using the DESeq268 package for

datasets with replicates (stem cells and embryogenesis). Log fold change greater

than0.5orlowerthat-0.5wereusedascut-offfortheregenerationdataset.

S17:GeneLossinPlanarians

S17.1Lostgenesidentificationandverification

ThedetectionofhighconfidencegenelosseventsinSmedwasperformedusingan

in-housecustompipeline.Briefly,weperformedapair-wisereciprocalBLASTsearch

against (a)Ensembl referenceproteomes (seeSupplementaryTable3 fordatabase

versions) and (b) 7proteomespredicted fromassembledplanarian transcriptomes

(including both Smed_v6 and Smes_v1). The inclusion of other planarian

transcriptomesprovidedanimportantguardagainstfalsepositivelosseventsdueto

accidental absence from one transcriptome. Orthology group clusters lacking any

planarian species were then further filtered and validated using the dd_Smed_g4

genomeassembly,asdescribedindetailbelow:

Step1–PairwisereciprocalBLASTandcoveragefiltering

WWW.NATURE.COM/NATURE | 31

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 32: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

For the pairwise reciprocal BLAST search, just the primary (i.e. longest non-

overlapping open reading frame) of the longest transcript of each Trinity graph

componentwasusedas input.BLASTsearcheswereperformedusingblastpwitha

BLOSUM45substitutionmatrixandane-valuecutoffof1e-3.

Step2-Orthologygroupconstruction

BLASTresultswerepre-filteredforsubjectandquerycoverageofatleast10%and

post-filtered for reciprocal best hits. Resulting data were clustered into orthology

groups(i.e.graphcomponents)usingtheigraphpackageinR56.

Orthology groups were assessed with respect to species presence/absence, and

groupscontainingatleast9non-flatwormspecies,noSmedandnootherplanarian

specieswereconsideredtobeputativelylostinSmed.

Step3-Genomiclossverificationusingexonerate

ToindependentlyverifythelossofsuchgenesintheSmedgenome,allotherprotein

members of planarian-deficient orthology groups were mapped onto the

dd_Smed_g4 assembly using exonerate (--showvulgar no --showalignment no --

model protein2genome -s 40 --showtargetgff yes --percent 2). Genes were

consideredasgenuinely lost ifnopoint inthegenomewascoveredbyoverlapping

exoneratealignmentsofmorethan2differentnon-planarianspecies.Thisthreshold

wasnecessary inordertofilterforspuriousalignmentsandresidual2-hitpositions

invariablyhadneitherRNAseqcoveragenorcodingpotential,asevidencedbyalack

ofblastxhitsagainstNCBInrdatabaseofthoseintervals.

Overall,ourpipelinebasesthedefinitionofgenelossontheabsenceofdetectable

sequencesimilaritybothin6independentlyassembledplanariantranscriptomesand

intheSmedgenome.Further,the limitationtoorthogroupswith>9non-flatworm

members generally restricts theanalysis to genes that arehighly conservedat the

sequencelevel.Althoughfalsepositivesareconsequentlyhighlyunlikely,wecannot

ruleoutthepresenceofhomologuesthat,onlyinplanarians,havedivergedbeyond

thepointofdetectablehomologybyconventionalsequencecomparisons.

Step4-Validationoflostgenecandidates

WWW.NATURE.COM/NATURE | 32

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 33: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

Theproteintranslationsofputativelylostgeneswereadditionallymanuallysearched

against Annelid, Molluscan and Platyhelminth transcriptomes using tblastn

(parameters:-wordsize2-matrixBLOSUM45-soft_maskingTRUE-evalue100).Hits

wereinspectedmanuallytoverifyabsenceintherespectivespecies(Supplementary

Table 5). The fact that we did not recover the previously reported loss of the

centrosome proteins69 indicates that our focus on deeply conserved geneswith a

highdegreeofsequenceconservationlikelyunderestimatestheextentofgeneloss

inSmed.

S17.2Essentialityassessmentoflostgenes

Thehumanand/ormousehomologuesofgeneslostinSmedwereusedtoquerythe

Database of Essential Genes (DEG v13.3)70. Positive hits were annotated as

“essential”inSupplementaryTable5.

S17.3Mad1/Mad2multiplesequencealignments

Multiple sequence alignments for Mad1 and Mad2 were generated using the

Constraint-based Multiple Alignment Tool (COBALT)71 (standard parameters) and

resulting alignments were rendered using the BoxShade website with standard

parameters (http://www.ch.embnet.org/software/BOX_form.html). Furthermore,

the COBALT alignments were imported into Geneious72 (v9.1.5 Build 2016-06-11

21:21,http://www.geneious.com)andsequencesimilaritymatricesbetweenspecies

were computedbasedonBLOSUM62 similaritymatrix (threshold0). The similarity

matriceswererenderedusingmatrix2png73.

S18:PlanarianSACFunction

S18.1Quantificationofmitoticindexindissociatedcellpreparations

RNAiandnocodazoletreatment

RNA interference was performed as previously described74 with some minor

modifications. In vitro synthesized dsRNA was diluted 1:20 in H2O and quantified

against a column purified standard eGFP control dsRNA (1 µg/µl) on a 0.7 % TAE

agarose gel using ImageStudio Lite (v5.2.5, LI-COR Biosciences). The quantified

WWW.NATURE.COM/NATURE | 33

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 34: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

dsRNAwasthenaddedtoliverpastetoafinalconcentrationof1µg/µlandfrozenat

-80°Cinsmallaliquots.Theadditionofredfooddyeastracerwasomitted,because

itgeneratedstrongfluorescentbackgroundintherhodaminechannel(see“Imaging

anddataanalysis”).

Foreachexperimentalreplicate,20animals(~0.5mm,asexualCIW4strain)werefed

3x with dsRNA food every third day. On the eighth day, worms were split into 2

batchesof10animalsandeithertransferredtodisheswith50ng/mlnocodazole+1

% DMSO (v/v) in planarian water (+gentamycin) or 1 % DMSO (v/v) in planarian

water(+gentamycin)asavehiclecontrol.Thewormsweretreatedfor24hoursand

thenmacerated.

Animalmaceration,proteinaseKtreatmentandformaldehydefixation

Animals were passively macerated in 4 ml of maceration solution (glycerol:glacial

aceticacid:H2Oinaratioof1:1:13(w/v/v))75for10minonthebenchatRT,followed

by10minonaverticalrotatoratRTuntilthetissuewascompletelydissociated.The

lysatewasfilteredthroughCellTrics®50µmfilters(Sysmex,Cat.No.:04-0042-2317).

Dependingonthesizeoftheanimals,thecellsuspensionwasdiluted1:4(small,7-8

mm) to1:10(large,16-18mm)usingmacerationsolutionand120µlofthediluted

cellsuspensionwastransferredintoeachwellofablack96-wellglassbottomplate

(Greiner Bio-One, Cat. No.: 655090). For each condition, five or six separatewells

wereusedastechnicalreplicates.Thecellswerepelletedfor5minat2,000xgatRT.

Then 44µl of 16% formaldehyde (PFA) in 1X PBSwere added and the cellswere

fixedfor20-25minatRTandrinsedoncewith200µl1XPBS,followedbyawashin

1XPBSfor5min.Afterfixation,thecellsweretreatedwith2µg/mlProteinaseKin

1XPBSfor10minatRT,rinsedoncewith1XPBSandpost-fixedwith4%PFAin1X

PBS for 10min at RT. Then, the cells were rinsed once in 1X PBS andwashed in

PBSTx0.1(1XPBS;0.1%(v/v)TritonX-100)for5min.

Riboprobehybridization,washing

TheriboprobehybridizationbufferswerepreparedaccordingtoPearsonetal.76.

The cells were incubated 5 min in 1:1 (v/v) PBSTx0.1:WashHyb and subsequently

blocked inPreHyb for45-60minat56 °C.PreHybwas removedandreplacedwith

WWW.NATURE.COM/NATURE | 34

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 35: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

pre-warmedHybcontainingDIG-labeledriboprobeagainstsmedwi-1andhybridized

over 16 h at 56 °C. Hybwas removed and the cells were briefly rinsedwith pre-

warmedWashHyb. The cellswere then successivelywashed once 5minwith pre-

warmed1:1SSC/WashHyb,followedby2x15min2XSSC+0.1%(v/v)TritonX-100,

2x15min0.2XSSC+0.1%(v/v)TritonX-100.Then,thecellswerebroughttoRTand

rinsedandwashedoncewithPBSTx0.1 for10min.Endogenousperoxidaseactivity

wasinactivatedwith100mMsodiumazideinPBSTx0.1for15minatRTandrinsed

onceandwashedtwicewithPBSTx0.1for10mineach.

Tyramidesignalamplification

Thecellswereblocked1hwith5%(v/v)horseserum+0.5%(v/v)RocheWestern

BlockingReagent(RWBR;Roche,CatNo.:11921673001)inPBSTx0.1atRTandthen

incubatedwithanti-DIG-PODFab fragments (Roche,CatNo.:11207733910) in1%

(v/v)horseserum/0.1%(v/v)RWBRinPBSTx0.1for2hatRT.Afterwashing3x20

minwithPBSTx0.1thesignalwasdevelopedwithrhodamine-tyramideinTSAbuffer

(2MNaCl,0.1Mboricacid,pH8.5)containing0.006%H2O2for10min.

Antibodystaining

Antibody staining was essentially performed as previously described with some

modifications77. After removing the TSA buffer and washing 3x for 10 min with

PBSTx0.1,primaryantibody(1:1,000anti-HistoneH3antibody(rabbit,phosphoS10+

T11)[E173];Abcam,Cat.No.:ab32107)in1%goatserum+0.1%RWBRwasadded

andincubated16hat4°C.Afterwashing4x10minwithPBSTx0.1,thesolutionwas

replaced with secondary antibody in 1% (v/v) goat serum + 0.1 % (v/v) RWBR

(1:1,000 goat anti-rabbit IgG (H+L), Alexa Fluor 647; ThermoFisher Scientific, Cat.

No.: A-21245) and DAPI (1 µg/ml) and incubated 2-3 h at RT in the dark. The

secondaryantibodywaswashedofffor3x20minwithPBSTx0.1andtheplateswere

storedin0.02%(w/v)sodiumazidein1XPBSat4°Cinthedark.

Imaginganddataanalysis

WWW.NATURE.COM/NATURE | 35

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 36: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

TheplateswereimagedonanOperettaHigh-ContentImagingSystem(PerkinElmer)

usinga20xlensandthefollowingexcitation/emissionfilters:DAPI(Ex.360-400/Em.

410-480nm),Rhodamine(520-550/560-630nm)andfarred(620-640/650-760nm).

Foreachwell, five separate locationswere imaged. Imageanalysiswasperformed

usingtheintegratedHarmonyHighContentImagingandAnalysisSoftwareandFIJI78.

Statisticalanalysis

Comparisons of group means (± S.D.) for the effect of RNAi interference of SAC

componentsontheabilityoftheplanarianneoblaststoelicitaSACresponseupon

nocodazoletreatment,atwo-wayanalysisofvariance(ANOVA)wasperformedand

incasesofsignificance,followedbyDunnett’stest.

Two-way ANOVA revealed a significant effect of SAC component RNAi on the

activationoftheSACbynocodazoletreatment(F(4,24)=124.7,P<0.0001).Forthe

testedRNAitargets,rod,zwilch,zw10yieldedsignificantreductionsofanti-H3ser10P

-positive cells (M-phase marker) upon nocodazole treatment (P < 0.0001), while

bub3 did not differ from control RNAi (egfp) (P = 0.78). RNAi itself did not

significantlyalterthebaselinenumberofcellsinM-phaseintheDMSOcontrols.

All statistical analyseswereperformedusingGraphPadPrismversion7.0c forMac

OSX,GraphPadSoftware,LaJollaCaliforniaUSA,www.graphpad.com.

S19:MiscellaneousExperimentalMethods

S19.1Animalhusbandry

Sexual(S2)andasexual(CIW4)strainsofSmedwerekeptinplasticcontainersin1X

Montjuïc salt water with 25 mg/L gentamycin sulfate. The animals were fed

homogenizedorganiccalfliverpaste.

S19.2Transcriptcloning

Target cDNA fragments were PCR-amplified from Smed cDNA using KAPA HiFi

HotStart DNA Polymerase (Kapa Biosystems), TAE agarose gel purified and cloned

intopPR-T4P79.Thecloned insertwasused forpreparationof riboprobetemplates

for insituhybridizationanddsRNAproduction forRNA interference.ThePlanMine

WWW.NATURE.COM/NATURE | 36

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 37: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

transcript IDs and gene specific forward and reverse primers used for cloning are

providedintheSupplementaryTable7.

S19.3Wholemountinsituhybridization

Wholemount in situ hybridization (WISH)was essentially performed as previously

described76,80.

S19.4Karyotyping

Forkaryotyping,animals (~5mm)ofsexualSmedwere fedthreetimeseverythird

dayusingcalfliverpaste.Thethirddayafterthelastfeed,theplanarianwaterwas

supplementedwith50ng/mlnocodazol (SigmaCat#M1404)and1% (v/v)DMSO

(SigmaCat # 276855) for a final 24h incubationperiod. Subsequently, 15 animals

weremacerated in5ml6.6% (v/v) glacial acetic acid supplementedwith5µg/ml

Hoechst 34580 (ThermoFisher Scientific Cat # H21486). Single cell dispersion was

achievedbycarefullyrotatingthemacerationsolutionforthelast10minofthe20

mintotalincubationperiod.Subsequently,50µlofthecellsuspensionwaspipetted

asasingledropona0.17mmthick/No.1.5borosilicatecoverslip(Menzel).Toassist

chromosomespreading,100µlof0.05%(v/v)TritonX-100solutionwasaddeddrop-

wisetotheslide.Chromosomespreadswereagedat37°Cfor60minuntiltheliquid

completely evaporated. Coverslips were mounted in 80 % (v/v) glycerol prior to

imaging.ChromosomeswereimagedonanAndorRevolutionWDBorealisconfocal

spinning disc system: The Olympus IX83 stand was equipped with an Andor iXon

Ultra 888 EMCCD camera and an Olympus 150x U Apochromat NA 1.45 Oil

immersionobjective.Hoechst34580stainedchromosomeswereexcitedwitha405

nmlaserdiodeanda452/45bandpassfilterwasusedtodetecttheemissionlight.

Imagestackswere3DdeconvolvedusingImageJ/Fijiplugins78,81andfinallymaximum

projectedforbettervisualization.

SupplementaryTablesSupplementaryTable1-CalculatedKimuradistancesandageforSmedLTR

elements

SupplementaryTable2-MARVELcomputationalrequirements&assemblystats

WWW.NATURE.COM/NATURE | 37

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 38: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

SupplementaryTable3-Conservedgenesusedfordivergenceanalysis

SupplementaryTable4-Speciesanddatabaseversionsusedfororthologyanalysis

SupplementaryTable5-Planarianspecificgenes

SupplementaryTable6-Lostgenegenes&essentialityanalysis

Supplementary Table 7 - PlanMine identifiers for transcripts andprimers used for

cloningusedinthisstudy

References1. Garcia-Fernàndez,J.,Baguñà,J.&Saló,E.Genomicorganizationand

expressionoftheplanarianhomeoboxgenesDth-1andDth-2.Development

118,241–253(1993).

2. Eid,J.etal.Real-timeDNAsequencingfromsinglepolymerasemolecules.

Science323,133–8(2009).

3. Putnam,N.H.etal.Chromosome-scaleshotgunassemblyusinganinvitro

methodforlong-rangelinkage.GenomeRes26,342–50(2016).

4. Sheffner,A.L.Thereductioninvitroinviscosityofmucoproteinsolutionsbya

newmucolyticagent,N-acetyl-L-cysteine.AnnNYAcadSci106,298–310

(1963).

5. Murray,M.G.&Thompson,W.F.Rapidisolationofhighmolecularweight

plantDNA.NucleicAcidsRes8,4321–5(1980).

6. Adams,M.D.etal.ThegenomesequenceofDrosophilamelanogaster.

Science287,2185–95(2000).

7. Koren,S.etal.Canu:scalableandaccuratelong-readassemblyviaadaptivek-

merweightingandrepeatseparation.GenomeRes27,722–736(2017).

8. Myers,G.EfficientLocalAlignmentDiscoveryamongstNoisyLongReads.

AlgorithmsbioinformaticsWABI2014LectnotesComputSci8701,52–67

(2014).

9. Chaisson,M.J.&Tesler,G.Mappingsinglemoleculesequencingreadsusing

basiclocalalignmentwithsuccessiverefinement(BLASR):applicationand

theory.BMCBioinformatics13,238(2012).

WWW.NATURE.COM/NATURE | 38

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 39: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

10. English,A.C.etal.Mindthegap:upgradinggenomeswithPacificBiosciences

RSlong-readsequencingtechnology.PLoSOne7,e47768(2012).

11. Walker,B.J.etal.Pilon:anintegratedtoolforcomprehensivemicrobial

variantdetectionandgenomeassemblyimprovement.PLoSOne9,e112963

(2014).

12. Langmead,B.&Salzberg,S.L.Fastgapped-readalignmentwithBowtie2.Nat

Methods9,357–9(2012).

13. Brandl,H.etal.PlanMine--amineableresourceofplanarianbiologyand

biodiversity.NucleicAcidsRes44,D764-73(2016).

14. Grabherr,M.G.etal.Full-lengthtranscriptomeassemblyfromRNA-Seqdata

withoutareferencegenome.NatBiotechnol29,644–52(2011).

15. Wu,T.D.&Watanabe,C.K.GMAP:agenomicmappingandalignment

programformRNAandESTsequences.Bioinformatics21,1859–75(2005).

16. Yu,G.,Wang,L.-G.,Han,Y.&He,Q.-Y.clusterProfiler:anRpackagefor

comparingbiologicalthemesamonggeneclusters.OMICS16,284–7(2012).

17. Robb,S.M.C.,Gotting,K.,Ross,E.&SánchezAlvarado,A.SmedGD2.0:The

Schmidteamediterraneagenomedatabase.Genesis53,535–546(2015).

18. Quinlan,A.R.&Hall,I.M.BEDTools:aflexiblesuiteofutilitiesforcomparing

genomicfeatures.Bioinformatics26,841–2(2010).

19. Kelley,D.R.&Salzberg,S.L.Detectionandcorrectionoffalsesegmental

duplicationscausedbygenomemis-assembly.GenomeBiol11,R28(2010).

20. Parsons,J.D.Miropeats:graphicalDNAsequencecomparisons.ComputAppl

Biosci11,615–619(1995).

21. Koutsovoulos,G.etal.Noevidenceforextensivehorizontalgenetransferin

thegenomeofthetardigradeHypsibiusdujardini.ProcNatlAcadSci113,

5053–5058(2016).

22. Rice,P.,Longden,I.&Bleasby,A.EMBOSS:TheEuropeanMolecularBiology

OpenSoftwareSuite.TrendsGenet16,276–277(2000).

23. Kent,W.J.etal.ThehumangenomebrowseratUCSC.GenomeRes12,996–

1006(2002).

24. Derrien,T.etal.Fastcomputationandapplicationsofgenomemappability.

PLoSOne7,e30377(2012).

WWW.NATURE.COM/NATURE | 39

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 40: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

25. Smit,A.,Hubley,R.&Green,P.RepeatMaskerOpen-4.0.Availableat:

http://www.repeatmasker.org.

26. Benson,G.Tandemrepeatsfinder:aprogramtoanalyzeDNAsequences.

NucleicAcidsRes27,573–80(1999).

27. Smit,A.&Hubley,R.RepeatModelerOpen-1.0.Availableat:

http://www.repeatmasker.org.

28. Bao,Z.&Eddy,S.R.Automateddenovoidentificationofrepeatsequence

familiesinsequencedgenomes.GenomeRes12,1269–76(2002).

29. Price,A.L.,Jones,N.C.&Pevzner,P.A.Denovoidentificationofrepeat

familiesinlargegenomes.Bioinformatics21Suppl1,i351-8(2005).

30. Li,H.Aligningsequencereads,clonesequencesandassemblycontigswith

BWA-MEM.0,2(2013).

31. Li,H.etal.TheSequenceAlignment/MapformatandSAMtools.

Bioinformatics25,2078–9(2009).

32. Dobin,A.etal.STAR:ultrafastuniversalRNA-seqaligner.Bioinformatics29,

15–21(2013).

33. Bao,W.,Kojima,K.K.&Kohany,O.RepbaseUpdate,adatabaseofrepetitive

elementsineukaryoticgenomes.MobDNA6,11(2015).

34. Xu,Z.&Wang,H.LTR_FINDER:anefficienttoolforthepredictionoffull-

lengthLTRretrotransposons.NucleicAcidsRes35,W265-8(2007).

35. Ellinghaus,D.,Kurtz,S.&Willhoeft,U.LTRharvest,anefficientandflexible

softwarefordenovodetectionofLTRretrotransposons.BMCBioinformatics

9,18(2008).

36. Lee,H.etal.MGEScan:aGalaxy-basedsystemforidentifying

retrotransposonsingenomes.Bioinformatics32,2502–4(2016).

37. Girgis,H.Z.Red:anintelligent,rapid,accuratetoolfordetectingrepeatsde-

novoonthegenomicscale.BMCBioinformatics16,227(2015).

38. Neph,S.etal.BEDOPS:high-performancegenomicfeatureoperations.

Bioinformatics28,1919–20(2012).

39. Eddy,S.R.AcceleratedProfileHMMSearches.PLoSComputBiol7,e1002195

(2011).

40. Finn,R.D.etal.ThePfamproteinfamiliesdatabase:towardsamore

WWW.NATURE.COM/NATURE | 40

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 41: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

sustainablefuture.NucleicAcidsRes44,D279-85(2016).

41. Edgar,R.C.MUSCLE:multiplesequencealignmentwithhighaccuracyand

highthroughput.NucleicAcidsRes32,1792–7(2004).

42. Edgar,R.C.SearchandclusteringordersofmagnitudefasterthanBLAST.

Bioinformatics26,2460–1(2010).

43. Katoh,K.&Standley,D.M.MAFFTmultiplesequencealignmentsoftware

version7:improvementsinperformanceandusability.MolBiolEvol30,772–

80(2013).

44. Okonechnikov,K.,Golosova,O.,Fursov,M.&UGENEteam.UniproUGENE:a

unifiedbioinformaticstoolkit.Bioinformatics28,1166–7(2012).

45. Lowe,T.M.&Eddy,S.R.tRNAscan-SE:aprogramforimproveddetectionof

transferRNAgenesingenomicsequence.NucleicAcidsRes25,955–64(1997).

46. Camacho,C.etal.BLAST+:architectureandapplications.BMCBioinformatics

10,421(2009).

47. Llorens,C.etal.TheGypsyDatabase(GyDB)ofmobilegeneticelements:

release2.0.NucleicAcidsRes39,D70–D74(2011).

48. deCastro,E.etal.ScanProsite:detectionofPROSITEsignaturematchesand

ProRule-associatedfunctionalandstructuralresiduesinproteins.Nucleic

AcidsRes34,W362-5(2006).

49. Marchler-Bauer,A.&Bryant,S.H.CD-Search:proteindomainannotationson

thefly.NucleicAcidsRes32,W327-31(2004).

50. Macas,J.&Neumann,P.Ogreelements-AdistinctgroupofplantTy3/gypsy-

likeretrotransposons.Gene390,108–116(2007).

51. Jin,Y.,Tam,O.H.,Paniagua,E.&Hammell,M.TEtranscripts:Apackagefor

includingtransposableelementsindifferentialexpressionanalysisofRNA-seq

datasets.Bioinformatics31,btv422(2015).

52. Denver,D.R.etal.Agenome-wideviewofCaenorhabditiselegansbase-

substitutionmutationprocesses.ProcNatlAcadSci106,16310–16314(2009).

53. Travers,K.J.,Chin,C.-S.,Rank,D.R.,Eid,J.S.&Turner,S.W.Aflexibleand

efficienttemplateformatforcircularconsensussequencingandSNP

detection.NucleicAcidsRes38,e159(2010).

54. Kurtz,S.etal.Versatileandopensoftwareforcomparinglargegenomes.

WWW.NATURE.COM/NATURE | 41

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 42: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

GenomeBiol5,R12(2004).

55. Cantarel,B.L.etal.MAKER:aneasy-to-useannotationpipelinedesignedfor

emergingmodelorganismgenomes.GenomeRes18,188–96(2008).

56. Csardi,G.&Nepusz,T.Theigraphsoftwarepackageforcomplexnetwork

research.InterJournalComplexSy,1695(2006).

57. Löytynoja,A.&Goldman,N.Phylogeny-awaregapplacementpreventserrors

insequencealignmentandevolutionaryanalysis.Science320,1632–5(2008).

58. Capella-Gutiérrez,S.,Silla-Martínez,J.M.&Gabaldón,T.trimAl:Atoolfor

automatedalignmenttrimminginlarge-scalephylogeneticanalyses.

Bioinformatics25,1972–1973(2009).

59. Stamatakis,A.RAxMLversion8:atoolforphylogeneticanalysisandpost-

analysisoflargephylogenies.Bioinformatics30,1312–3(2014).

60. Cannon,J.T.etal.XenacoelomorphaisthesistergrouptoNephrozoa.Nature

530,89–93(2016).

61. Harris,R.S.ImprovedpairwisealignmentofgenomicDNA.(Ph.D.Thesis.The

PennsylvaniaStateUniversity,2007).

62. Kent,W.J.,Baertsch,R.,Hinrichs,A.,Miller,W.&Haussler,D.Evolution’s

cauldron:duplication,deletion,andrearrangementinthemouseandhuman

genomes.ProcNatlAcadSciUSA100,11484–9(2003).

63. Hiller,M.etal.Computationalmethodstodetectconservednon-genic

elementsinphylogeneticallyisolatedgenomes:applicationtozebrafish.

NucleicAcidsRes41,e151(2013).

64. Wasik,K.etal.Genomeandtranscriptomeoftheregeneration-competent

flatworm,Macrostomumlignano.ProcNatlAcadSciUSA112,12462–7

(2015).

65. Li,L.,Stoeckert,C.J.&Roos,D.S.OrthoMCL:identificationoforthologgroups

foreukaryoticgenomes.GenomeRes13,2178–89(2003).

66. Letunic,I.,Doerks,T.&Bork,P.SMART:recentupdates,newdevelopments

andstatusin2015.NucleicAcidsRes43,D257-60(2015).

67. Jones,P.etal.InterProScan5:genome-scaleproteinfunctionclassification.

Bioinformatics30,1236–40(2014).

68. Love,M.I.,Huber,W.&Anders,S.Moderatedestimationoffoldchangeand

WWW.NATURE.COM/NATURE | 42

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 43: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

dispersionforRNA-seqdatawithDESeq2.GenomeBiol15,550(2014).

69. Azimzadeh,J.,Wong,M.L.,Downhour,D.M.,Alvarado,A.S.&Marshall,W.F.

CentrosomeLossintheEvolutionofPlanarians.Science(80-)335,461–463

(2012).

70. Luo,H.,Lin,Y.,Gao,F.,Zhang,C.-T.&Zhang,R.DEG10,anupdateofthe

databaseofessentialgenesthatincludesbothprotein-codinggenesand

noncodinggenomicelements.NucleicAcidsRes42,D574-80(2014).

71. Papadopoulos,J.S.&Agarwala,R.COBALT:constraint-basedalignmenttool

formultipleproteinsequences.Bioinformatics23,1073–9(2007).

72. Kearse,M.etal.GeneiousBasic:anintegratedandextendabledesktop

softwareplatformfortheorganizationandanalysisofsequencedata.

Bioinformatics28,1647–9(2012).

73. Pavlidis,P.&Noble,W.S.Matrix2png:autilityforvisualizingmatrixdata.

Bioinformatics19,295–296(2003).

74. Rouhana,L.etal.RNAinterferencebyfeedinginvitro-synthesizeddouble-

strandedRNAtoplanarians:methodologyanddynamics.DevDyn242,718–30

(2013).

75. David,C.N.Aquantitativemethodformacerationofhydratissue.Wilhelm

RouxArchEntwicklMechOrg171,259–268(1973).

76. Pearson,B.J.etal.Formaldehyde-basedwhole-mountinsituhybridization

methodforplanarians.DevDyn238,443–450(2009).

77. Newmark,P.A.&SánchezAlvarado,A.BromodeoxyuridineSpecificallyLabels

theRegenerativeStemCellsofPlanarians.DevBiol220,142–153(2000).

78. Schindelin,J.etal.Fiji:anopen-sourceplatformforbiological-imageanalysis.

NatMethods9,676–82(2012).

79. Rink,J.C.,Gurley,K.A.,Elliott,S.A.&SanchezAlvarado,A.PlanarianHh

SignalingRegulatesRegenerationPolarityandLinksHhPathwayEvolutionto

Cilia.Science(80-)326,1406–1410(2009).

80. King,R.S.&Newmark,P.a.Insituhybridizationprotocolforenhanced

detectionofgeneexpressionintheplanarianSchmidteamediterranea.BMC

DevBiol13,8(2013).

81. Kirshner,H.,Aguet,F.,Sage,D.&Unser,M.3-DPSFfittingforfluorescence

WWW.NATURE.COM/NATURE | 43

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

Page 44: SUPPLEMENTAR Y INFORMATION using the common pH indicator dye phenol red4. The animals were washed in 10 ml freshly prepared NAC stripping solution (0.5 w/v N-acetyl-L-cysteine, 20

microscopy:implementationandlocalizationapplication.JMicrosc249,13–25

(2013).

WWW.NATURE.COM/NATURE | 44

SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473