The Worm Genome and Transcriptome Assembly
Genome and Transcriptome Annotation
Transcripts Involved in Regeneration in M. lignano
Additional Findings
RNAseq
0 3 6 12 24 48 72hours
RNAextraction
DEseqamputation
0h 3h 6h 12h 24h 48h 72h5 0 5
Value
Color Key
MAPK signalling pathwayLIF LIFR
JAK STAT3
Grb2 Mek
PI3K Akt
Erk1/2
Klf4
c-Myc
Tbx3Sox2
Nanog
Oct4
Jak-Stat signalling pathway
MAPK signalling pathway
ActivinNodal ACVRI/II SMAD2/3 Proliferation
DNA
DNA
Bmp4 BMPRI/II
SMADs
SMAD4
Erk1/2p38
ID
DUSP9
TGFβRTGFβ
TGFβ signalling pathway
Wnt Frizzled Dvl GSK-3β β-cateninAxin APC
Wnt signalling pathway
FGF2 FGFR Ras Raf Mek Erk1/2
IGF IGF-IR PI3K Akt
DNA
DNA
DNA
DNA
DNA
DNA
Core transcriptional network
c-MycMAPK signalling pathway
PI3K-Akt signalling pathway
Pluripotency pathways from human/murine stem cells:
Not found in M.lignano transcriptome
Found in M.lignano transcriptome
Factors with TGFβ -like domain found M.lignano transcriptome
STAT3 not found, other STATs identified in M.lignano transcriptome
KLF4 not found, other KLFs identified in M.lignano transcriptome
BMP4 not found, other BMPs identified in M.lignano transcriptome
DNA
DNA
DNA
DNA
DNA
Genes from human/murine stem cells:
0.3
max_hvu
diminutive_dme
5392611_gvu
4367212__psi
2235912_cle
8844611_ece
8320811_csp
1677112_pca
max_aqu
329761_mili
8264415_csp
6889311_pst
358631_mosp
232631_meli
1291201_ltr
3759811_pca
331811_mosp
3433911_mli
4337911_mfu
4338711_sst
max_hsa
1643511_psi
7137911_rsp
1674911_mfu
3098031_mfu
USF2_hsa
7573711_mli
4658511_mili
170531_meli
4991111_ltr9649121_mli
5204311_gvu
4338712_sst
401711_ece
1384111_pca
8002813_rsp
1535711_msc
1865211_psi
1677111_pca
myc2_hvu
S000209_sma
4247911_psi
133235_lgi
2472711_psi
9705812_mli
3981211_mfu
96815g22_mli
8721511_ltr
5733111_gvu
4658514_mili
S35429_sma
368781_mosp
mycl_hvu
6927111_pst
6637312_rsp
2694411_nco2289111_nco
7019111_pst
3029911_gvu
193181i_pca
Max_dme
500662_mili
myc_aqu
MXL1_cel
lmyc_hsa
1961211_msc
4186911_sst
1086232_mli
437061_meli
442451_meli
1131511_msc
2256811_cle
5046551_psi
173401_mosp
6012811_rsp
8769711_nco
785741_scma-785751_scma
4186914_sst
370851_mili
323911_mosp
5856311_mfu
1028284_mli
nmyc_hsa
699701_csp
1030111_mli
166474_cte
USF1_hsa
MXL3_cel
88480_lgi
740831_csp
myc1_hvu
4161311_mfu
301971_mosp
118760_cte
cmyc_hsa
8624611_mli
myc_pdu
Max_pdu
mycAl_hvu
0.837
0.891
0.9760.784
0.998
0.892
0.728
0.938
0.9180.769
0.8620.834
0.963
0.7430.695
0.923
0.861
0.81
0.9
0.974
0.727
0.789
0.98
0.918
0.995
0.930.767
0.836
0.953
0.9
0.815
0.771
0.996
1
1
0.974
0.822
0.931
0.770.907
0.783
0.903
0.998
0.787
0.919
0.996
0.962
0.991
0.828
0.969
1
0.926
0.951
0.928
1122 820
500
1439
402480
355
317
1747
535
262
281
218
581
1368
M. lig
nano
S. mediterranea C. elegans
D. m
elanogaste
r
Stephanostomum sp. -ACC----TATACGGTT---CTCT-GCCGTGTA------TATTAGT-C-ATGGT-AAGAA
Haematolechus sp. -ACC----TATACGGTT---CTCT-GCCGTGTA------TCAGTG--C-ATGGT-AAGAA
Fasciola sp. AACC----TTAACGGTT---CTCTTGCCCTGTA------TATTAGTGC-ATGGTAAAGAA
S.mansoni AACC----GTCACGGTT---TTAC--TCTTGTG------ATTTGTTGC-ATGGT-AAGAA
Echinococcus sp. CACCG--TTAATCGGTC---CTTA--CCTTGCA------ATTTTGT---ATGGT-GAGTA
M.lignano -GCCG--TAAAGACGGT---CTCTTACTGCGAAGACTCAATTTATTGC-ATGCT-CAGTA
S.med SL1 -GCCG--TTAGACGGTC---TTATCGAAATCTATAT---AAATCTTAT-ATGGT-ACGGA
S.med SL2 -GCCG--TTAGACGGTC---TTATCGAAATCTATAT---AAAAATTAT-ATGGT-GAGG
A
Stylochus sp. TGCCGTATTTGACGGTCTCAAAAATTTCGTGTTTATTGCAATAATTGCAATGGT-AAGCA
Notoplana sp. TGCCGTATTTGACGGTCTCAAAAATTTCGTGTTTATTGCAATAATTGCAATGGT-AAGCA
.** : . * : : : *** * .* *
Stephanostomum sp. TCGAA-----TTCGAC------CTATGGTCGAATAA-ATTCTTTGGCTAG-CCTCT----
Haematolechus sp. TCGAG-----TTCGACTCACATCGTTGGTCGAATAAGATTATTTGGCTAG-CCTCCACTC
Fasciola sp. TCG-------TTGGAC------CATCGGTCCAAACCCATTATTTGGCTAG-CCTCCATTC
S.mansoni CCG--------TCGAC------CAAGAATCGAAGTT--TTCTTTGGCAGC-CCTAACACA
Echinococcus sp. TCGATGCAGCTCAGGCTG-TGCCTACGGAGCTGACCCAGTATTTGGCTGGTCCTT-----
M.lignano TCGACCCAGCTTCATCAAAT-AAAAGAATGCGAATCGAATATACAGCCGAGCCCGACAAC
S.med SL1 CCG--------TTATC------CAACATTAGTTGGTTAATTTTTGACAGTCACTTGAATC
S.med SL2 CCG--------TTTGC------CAGCATTAGTTGGCTAATTTTTGACAGTAGCTTGCAT
-
Stylochus sp. TCAAAT-------GAT------CCAGTGTGATCGTCGAGTCTTTG--ACAGGCCG-----
Notoplana sp. TCAAA--------GAT------CCA-TGTGATCGTCGAGTCTTTGACACAGGCCG-----
*. . : * *: . *
Stephanostomum sp. ---TCGGGGGCTAA------ 96 Haematolechus sp. TGGTCGGGGGCTA------- 108 Fasciola sp. TG--CAGAGGCTAAGAATCC 110 S.mansoni ----CGGGG----------- 91 Echinococcus sp. ----CGAGGGCC-------- 105 M.lignano TCGGCACTGTCTGCTCCGC- 130 S.med SL1 --ACAAGTGACTAT------ 107 S.med SL2 --GCAAGTGACTAT------ 106 Stylochus sp. ----CGAGGCCTATAT---- 111 Notoplana sp. ----CAAGGCCTATTT---- 111 .. *
C.elegans
D.melanogaster
H.sapiens
S.mansoni
M.lignano 100 wormsM.lignano sorted stem/germ cellsM.lignano RNA
Sequence complexity
Norm
alize
d fre
quen
cy X
104
0 20 10040 60 800
2
4
6
8
10
12
Genome% of bases masked by TRF
M.lignano 24.8
C.elegans 6.8
D.melanogaster 4.7
H.sapiens 2.2
S.mansoni 0.3
A
**
eyes
pharynx
testes ovaries egg seminal vesicle
stylet
brain
vesiculagranulorum
femaleantrumgut
B E
Nematostella vectensis
Aurelia aurita
Hydra magnipapillata
Isodiametra pulchra
Macrostomum lignano
Schistosoma japonicum
Schistosoma mansoni
Dugesia japonica
Schmidtea mediterranea
Agropecten irradians
Lumbricus rubellus
Platynereis dumerilii
Daphnia pulex
Anopheles gambiae
Drosophila melanogaster
Caenorhabditis elegans
Caenorhabditis briggsae
Strongylocentrotus purpuratus
Ciona intestinalis
Danio rerio
Xenopus laevis
Gallus gallus
Homo sapiens
Cnidaria
Acoela
Ecdysozoa
Loph
otro
choz
oaDe
uter
osto
mia
0.1
Stenostomum sthenum
Catenula lemnaeMicrostomum lineare
Macrostomum lignano
Prorhynchus stagnalis
Geocentrophora sphyrocephalaProsthiostomum siphunculusMaritigrella crozieri
Leptoplana tremerralis
Echinoplana celerrima
Microdalyellia schmidtii
Microdalyellia fuscaMesostoma lingua
Nematoplana coelogynoporoidesMonocelis sp.
Itaspiella helgolandicaSchmidtea mediterranea
Dugesia japonicaProcotyla fluviatilis
Dendrocoelum lacteumBothrioplana semperi
Neobenedenia melleni
Gyrodactylus salaris
Taenia solium
Schistosoma japonicumSchistosoma mansoni
Clonorchis siniensis
C
D
CatenulidaMacrostomorphaLecithoepitheliataPolycladidaRhabdocoelaProseriata
TricladidaBothrioplanidaMonogeneaCestodaTrematoda
Platyhelminthes
malegonopore
adhesiveorgans
* *head
tail
DAPIEdU
TO
DE
50μm
0 100 200 300 400 500 600
05
1015
20
M. lignano K-mer model
23-mer Coverage
23-m
er fr
eque
ncy
X105
observederrorsp1= 110p2= 220p3= 330p4= 440composite
F
0 20 40 60 80 100
Cont
ig s
ize
NG(x) %
ML2 AssemblyML1 Assembly
100
100
0 1
e+06
1e+
05 1
e+04
G
A B
A B
C D
A B
The free-living flatworm, Macrostomum lignano, much like its better known planarian relative, Schmidtea mediterranea, has a nearly unlimited regenerative capacity. Following injury, this species has the ability to regenerate almost an entirely new organism. This is attributable to the presence of an abundant somatic stem cell population, the neo-blasts. These cells are also essential for the ongoing maintenance of most tissues, as their loss leads to the rapid and irreversible degeneration of the animal. This set of unique properties makes flatworms an attractive species for studying the evolution of pathways involved in self-renewal, fate specification, and regeneration. The use of Macrosto-
mum lignano, or other flatworms, as models, however, is hampered by the lack of a well-assembled and annotated genome sequence, fundamental to modern genetic and mo-lecular studies.
Here we report the genomic sequence of Macrostomum lignano and an accompanying characterization of its transcriptome. The genome structure of Macrostomum lignano is remarkably complex, with ~75% of its sequence being comprised of simple repeats and transposon sequences. This has made high quality assembly from Illumina reads alone impossible (N50=414bp). We therefore obtained 130X coverage by long sequencing reads from the PacBio platform and combined this with more than 250X Illumina coverage to create a mixed assembly with a significantly improved N50 of 64 kb.
We complemented the reference genome with an assembled and annotated transcriptome, and used both of these datasets in combination to probe gene expression patterns during regeneration, examining pathways important to stem cell function. Aditionally we found evidence of low levels of CpG methylation in Macrostomum lignano’s genome and evidence of trans-splicing in the worm’s transcriptome. Interestingly we found that flatworms lack Myc - a very conserved pluripotency factor in Bilaterians and beyond (cnidarians, poriferans). As a whole, our data will provide a crucial resource for the community for the study not only of invertebrate evolution but also of regeneration and so-matic pluripotency.
Genome and transcriptome of the regeneration-competent flatworm, Macrostomum lignano.
Wasik K.A.*1, Gurtowski J.*1, Zhou X.1,2, Ramos O.M1, Delas M.J.1,3, Battistoni G.1,3, El Demerdash O.1, Falciatori I.1,3, Vizoso D.B.4, Smith A.D.5 ,, Ladurner P.6, Scharer L.4, McCombie W.R.1, Hannon G.J.1,3 and Schatz M.1
1 Watson School of Biological Sciences, Howard Hughes Medical Institute, Cold Spring Harbor Laboratory, New York 11724, USA; 2 Molecular and Cellular Biology Graduate Program,
Stony Brook University, NY 11794; 3 Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom; 4 Department of
Evolutionary Biology, Zoological Institute, University of Basel, 4051 Basel, Switzerland; 5 Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089,
USA; 6 Department of Evolutionary Biology, Institute of Zoology and Center for Molecular Biosciences Innsbruck, University of Innsbruck, A-6020 Innsbruck, Austria
A. Phylogenetic analysis of 23 animal species using partial sequences of 43 genes. Fig. modified from Egger et al. (2015). B. Interference contrast image and a diagrammatic
representation of an adult Macrostomum lignano. C. Phylogenomic analysis of 27 flatworm species (21 free-living and 6 neodermatan) using >100,000 aligned amino acids.
Fig. modified from Egger et al. (2015). D. Electron micrograph of a M. lignano neoblast. Note the small rim of cytoplasm (yellow) and the lack of cytoplasmic differentiation.
er - endoplasmic reticulum; mi - mitochondria; mu - muscle; ncl - nucleolus; nu - nucleus (red). E. Immunofluorescence labeling of dividing neoblasts with EdU (red) in an
adult worm. All cell nuclei are stained with DAPI (blue). T- testes, O - ovaries, DE – developing eggs, asterisks denote eyes. F. Representation of 23-mer frequency and cover-
age in the Illumina sequencing data generated from DNA extracted from a population of adult worms. M. lignano shows unusual 4-modal 23-mer distribution. G. Comparison
of Illumina only (ML1) and Pacbio (ML2) assemblies. Contig length distribution (Log2 scale) over the M. lignano genome in the ML1 (green) and ML2 (red) assemblies. Note
that the ML1 assembly covers only about 55% of the genome.
10 20 Contig size X 10 Kbp
C1 Contig ID
miRNA count Large repetitive elementsTrancsript abundance
Tandem repeat unit size in bpTandem repeat count
GC content
Illumina coverage
0 10 20 30 40 50 60 0 10 20 30 40 50 0 10 2030
40
010
2030
40
010
20
010203040
01020
3040
0102030
40010
2030400102030400
102030
0102030
0102030
010
2030
010
2030
010
2030
0
102030
0102030
010203001020300102030010203001020300102030
0102030
0102030
010
2030
0102030
0102030
010
2030
010
2030
010
20300
1020
300
102030
01020300
1020300
102030
01020300
10
2030
010
2030
0
102030
0102030
010
2030
010
20300
1020
300
1020
30 010
2030 0
102030 0
1020 0
10 20
C1C50
10
A. Schematic representation of the experimental design: 200 worms (per replicate) underwent amputation at a level between the brain and the gonads. The heads were
allowed to regenerate, and regenerating animals were collected at different timepoints post amputation (0, 3, 6, 12, 24, 48, 72 hours). RNA-Seq libraries from each timepoint
were analyzed for differentially expressed genes. Below - a heat map of differentially expressed genes at different regeneration timepoints. Each replicate is plotted separately.
Downregulated and upregulated transcripts are labeled in green and red, respectively. Scale covers Log2 values. The samples are grouped with complete-linkage clustering
using Euclidean distance. B.Known pluripotency pathways from H. sapiens and M. musculus were adapted from the Kyoto Encyclopedia of Genes and Genomes. Factors
that had potential homologues in M. lignano are labeled.
A.Overview of the 50 largest contigs in the M. lignano genome, making up about 2.6 % of the total assembly. Different tracks denote (moving inwards): contig size X 10 Kbp;
miRNA count (1-54 mapped miRNAs); large repetitive elements (RepeatScout) (1-4476 identified repeats); transcript count (1-43 mapped transcripts); Tandem repeat unit
size in base pairs (1-500); Tandem repeat count (1-28); GC content (0-1); and Illumina coverage (4-160X). The color gradients correspond to the range of values for each
track (lower values are lighter, higher values are darker). B. Sequence complexity comparison across five organisms. Drosophila melanogaster has an abundance of very low
complexity sequence, not found in the other species. Macrostomum lignano has a sizable amount of moderately complex sequences that are not found in other species and
that do not appear to be expressed. C. Tandem Repeat Finder was run on five species to assess their tandem and low complexity sequence composition. Macrostomum
lignano had far more bases masked by Tandem Repeat Finder than the other organisms in the test set. D.The number of reciprocal blast hits against the Homo sapiens tran-
scriptome for four different species: Macrostomum lignano, Schmidtea mediterranea, Caenorhabditis elegans, and Drosophila melanogaster. Only the number of hits passing
the E-value cutoff of ≤1e-10 is shown.
A. M. lignano is lacking the Myc gene. Evolution of the of Myc and Max gene families across different repre-
sentatives of the animal phyla. Mycs and Maxs gene candidates are retrieved based on reciprocal best
BLASTp from the available transcriptomes. The distance tree was inferred using neighbor-joining based on
JTT sequence evolution model (1000 bootstrap replicates). Human USF proteins are used as an outgroup.
The Myc branch is labeled in green, the Max branch is labeled in blue. dme – D. melanogaster, hsa – H.
sapiens, lgi – L. gigantea, cte – C. teleta, hvu – H. vulgaris, aqu – A. queenslandica, cel – C. elegans, mli –
M. lignano, mfu - M. fusca, mosp – Monocelis sp, psi – P. siphunculus, ltr – L. tremelaris, ece – E. celerrima,, meli – M. lingua , msc – M. schmidtii, mili – M. lineare, nco – N. coelogynoporoides, rsp – Rhabdopleura sp., gvu – G. vulgaris, csp – Cerebratulus sp., pca – P. caudatus, sst
– S. sthenum, cle – C. lemnae, pdu – P. dumerilli, sma – S. mediterranea, scma – S. mansoni. Transcript ID is next to each phylum name. B. Trans splicing in M. lignano. Align-
ment between first 130nt of M. lignano’s putative SL RNA and SL RNAs from other flatworms. The conserved splice junction is indicated by an arrowhead. Spliced leader
sequences are labeled in blue. The potential initiator AUG (last three nucleotides of the spliced leader) is labeled in green.
Conclusions:
• We have assembled and annotated a highly repetitive genome using a mix of Pacbio and Illumina sequencing
• We have found that:
- M. lignano’s genome shows evidence of CpG methylation
- It has retained a large number of homeoboxes as compared to other flatworms
- The transcriptome shows evidence of trans-splicing
- Flatworms lost the very conserved Myc gene
• We have characterized the gene expression patterns during regeneration in M. lignano
• Wasik et al. PNAS (2015); doi: 10.1073/pnas.1516718112