93
1 SUPPLEMENTARY INFORMATION The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunt a , Isheng J. Tsai a , Avril Coghlan a , Adam J. Reid a , Nancy Holroyd, Bernardo J. Foth, Alan Tracey, James A. Cotton, Eleanor J. Stanley, Helen Beasley, Hayley Bennett, Karen Brooks, Bhavana Harsha, Rei Kajitani, Arpita Kulkarni, Dorothee Harbecke, Eiji Nagayasu, Sarah Nichol, Yoshitoshi Ogura, Michael A. Quail, Nadine Randle, Dong Xia, Norbert W. Brattig, Hanns Soblik, Diogo M. Ribeiro, Alejandro Sanchez-Flores, Tetsuya Hayashi, Takehiko Itoh, Dee R. Denver, Warwick Grant, Jonathan D. Stoltzfus, James B. Lok, Haruhiko Murayama, Jonathan Wastling, Adrian Streit, Taisei Kikuchi b , Mark Viney b , Matthew Berriman b a Equal contributors b Corresponding authors SUPPLEMENTARY NOTE 1. Collection of parasite material for genome assembly. 2. Collection of sex-specific material for re-sequencing to investigate chromatin diminution and X chromosomal regions. 3. Determination of sex chromosome-specific sequences in S. ratti, S. stercoralis, and P. trichosuri, and chromatin diminution regions in S. papillosus. 4. Collection of parasite material for RNA-seq and proteomic analysis. 5. Library preparation and genome sequencing. 6. Optical map of S. venezuelensis. 7. Assembly and manual improvement. 8. Repeat analysis. 9. Gene finding. 10. Functional annotation. 11. Identification of gene families, orthologs and paralogs. 12. Species tree reconstruction. 13. Analysis of intron-exon structure. 14. Synteny analysis. 15. Mitochondrial genomes. 16. Gene ontology (GO). 17. RNA preparation and RNA-seq. 18. Transcriptome differential expression analysis. 19. Identification of gene clusters. 20. Astacin-like metallopeptidase and SCP/TAPS coding genes. 21. Proteome analysis. 22. S. ratti intrachromosomal homogeneity. Nature Genetics: doi:10.1038/ng.3495

The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

1

SUPPLEMENTARY INFORMATION

The Genomic Basis of Parasitism in the Strongyloides Clade of

Nematodes

Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy Holroyd, Bernardo J. Foth, Alan Tracey, James A. Cotton, Eleanor J. Stanley, Helen Beasley, Hayley Bennett, Karen Brooks, Bhavana Harsha, Rei Kajitani, Arpita Kulkarni, Dorothee Harbecke, Eiji Nagayasu, Sarah Nichol, Yoshitoshi Ogura, Michael A. Quail, Nadine Randle, Dong Xia, Norbert W. Brattig, Hanns Soblik, Diogo M. Ribeiro, Alejandro Sanchez-Flores, Tetsuya Hayashi, Takehiko Itoh, Dee R. Denver, Warwick Grant, Jonathan D. Stoltzfus, James B. Lok, Haruhiko Murayama, Jonathan Wastling, Adrian Streit, Taisei Kikuchib, Mark Vineyb, Matthew Berrimanb

a Equal contributors b Corresponding authors

SUPPLEMENTARY NOTE 1. Collection of parasite material for genome assembly. 2. Collection of sex-specific material for re-sequencing to investigate

chromatin diminution and X chromosomal regions. 3. Determination of sex chromosome-specific sequences in S. ratti, S.

stercoralis, and P. trichosuri, and chromatin diminution regions in S. papillosus.

4. Collection of parasite material for RNA-seq and proteomic analysis. 5. Library preparation and genome sequencing. 6. Optical map of S. venezuelensis. 7. Assembly and manual improvement. 8. Repeat analysis. 9. Gene finding. 10. Functional annotation. 11. Identification of gene families, orthologs and paralogs. 12. Species tree reconstruction. 13. Analysis of intron-exon structure. 14. Synteny analysis. 15. Mitochondrial genomes. 16. Gene ontology (GO). 17. RNA preparation and RNA-seq. 18. Transcriptome differential expression analysis. 19. Identification of gene clusters. 20. Astacin-like metallopeptidase and SCP/TAPS coding genes. 21. Proteome analysis. 22. S. ratti intrachromosomal homogeneity.

Nature Genetics: doi:10.1038/ng.3495

Page 2: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

0

SUPPLEMENTARY TABLES 1. Properties of the (a) genome assemblies and (b) predicted gene sets of four species of

Strongyloides, Parastrongyloides trichosuri and Rhabditophanes sp. and eight outgroup species.

2. Intron size and characteristics in nematodes. 3. Gene synteny between S. ratti and three species of Strongyloides and

Parastrongyloides trichosuri. 4. Chromosomal regions that undergo chromatin diminution or belong to the X

chromosome. 5. The use of genetic markers to identify regions of chromatin diminution in S. papillosus. 6. Diminished and non-diminished S. papillosus genes compared to S. ratti. 7. Mitochondrial genomes of Strongyloides spp., Parastrongyloides trichosuri and

Rhabditophanes sp. 8. Compara gene families of the six species and eight outgroup species. 9. Protein domain combinations for astacin-like metallopeptidases and SCP/TAPS coding genes. 10. Astacin-like metallopeptidases and SCP/TAPS. 11. Novel gene families. 12. Summary of transcriptome and proteome data. 13. Results of edgeR analysis of differential gene expression in S. ratti and S. stercoralis. 14. Orthologous genes that are upregulated in parasitic or free-living females. 15. Enriched Compara gene families. 16. Enrichment of gene ontology annotation terms among differentially expressed genes

of S. ratti and S. stercoralis. 17. Results of LC-MS proteome analysis of S. ratti. 18. Comparison of the proteome and the transcriptome of S. ratti. 19. Analysis of the excretory/secretory (ES) proteome of S. ratti. 20. Clusters of physically adjacent genes upregulated in the same stage of the life cycle. 21. Astacin and SCP/TAPS coding gene clusters. 22. Results of analysis of gene clusters. 23. Genomic libraries. 24. RNA sequencing data sets. SUPPLEMENTARY FIGURES 1. Parsimony analysis of conserved intron regions. 2. The distribution of differentially upregulated genes across the S. ratti and S. stercoralis

genomes. 3. Comparison of gene and repeat distribution in S. ratti and C. elegans chromosomes. 4. The gain and loss of nematode gene families. 5. The transcriptome and proteome of S. ratti. 6. Gene clustering in S. ratti and S. stercoralis. 7. The chromosome number of Rhabditophanes sp.

SUPPLEMENTARY REFERENCES

Nature Genetics: doi:10.1038/ng.3495

Page 3: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

1

SUPPLEMENTARY NOTE

1. Collection of parasite material for genome assembly

Strongyloides ratti

The S. ratti reference genome was assembled from genomic DNA obtained from S. ratti

isofemale line ED321 Heterogonic1, maintained in female Wistar rats (Charles River

Laboratories Inc.). This work was performed under the authority of licences issued by the

Animals (Scientific Procedures) Act 1986.

Fecal material was collected from S. ratti-infected rats. This was cultured and maintained

at 19 °C, such that after three days infective third stage larvae (iL3s) had migrated into the

water surrounding the fecal culture, from where they were harvested following procedures

previously described1. Samples were snap frozen in liquid nitrogen and stored at -80 °C.

Genomic DNA was extracted from pools of these flash frozen iL3s, as previously

described2 or using the Promega Wizard Genomic DNA Purification Kit following the

manufacturer‟s instructions.

Strongyloides stercoralis

S. stercoralis isofemale line PV0013,4 as maintained in purpose-bred, prednisone-treated

mix breed male dogs, aged six months to four years according to protocols 802593 and

804883 approved by the Institutional Animal Care and Use Committee (IACUC) of the

University of Pennsylvania, Philadelphia, USA. Both IACUC protocols, as well as routine

husbandry care of the dogs, were conducted in strict accordance with the Guide for the

Care and Use of Laboratory Animals of the National Institutes of Health.

Pure cultures of S. stercoralis were made and iL3s obtained as previously described4;

briefly, iL3s were isolated from charcoal coprocultures incubated for seven days at 25 C

using the Baermann technique5, purified by migration through low gelling temperature

agarose4, concentrated by centrifugation and stored at -80 C.

Genomic DNA was extracted from S. stercoralis iL3s. Thawed worms were subjected to

three washes in M9 buffer and centrifuged at 5,000 r.p.m. for five min at 4 C. Genomic

DNA was extracted from washed worm pellets and contaminating RNA removed using the

DNeasy Blood and Tissue Kit (Qiagen) according to the manufacturer‟s instructions with

two exceptions: (i) omission of unnecessary procedures for tissue mincing and (ii)

substitution of phenol:chloroform extraction for the spin column purification step called for

in the kit. DNA pellets from the phenol:chloroform extraction were dissolved in nuclease-

free water and purified using the DNA Clean and Concentrator Kit (ZYMO Research Corp.)

following the manufacturer‟s protocol.

Strongyloides papillosus

Nature Genetics: doi:10.1038/ng.3495

Page 4: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

2

S. papillosus isolate LIN6 was maintained in female New Zealand White rabbits purchased

from Charles River Laboratories Inc. as previously described6. This work was approved by

the Regierungspraesidium Tübingen, reference numbers 35/9185.82-2 and 35/9185.82-5.

Feces from infected rabbits was cultivated at 25 °C for six days. By this time the cultures

consisted largely of iL3s. The worms were collected using Baermann funnels as previously

described6, washed repeatedly with tap water and transferred into a 15 mL Falcon tube.

The worms were allowed to sediment resulting in an approximately 3.5 mL worm pellet

and, after as much liquid as possible had been removed, frozen in liquid nitrogen.

The worms were thawed and frozen in liquid nitrogen three times before the DNA was

isolated using the Qiagen DNeasy tissue kit following the manufacturer‟s instructions for

animal tissue. The optional RNAse digestion step was done. Elution from the column was

done first with 200 μL EB and then repeated with 100 μL EB. At the end DNA was

precipitated by adding 30 μL of 3 M sodium acetate and 850 μL of ethanol and redissolved

in 100 μL of kit EB buffer.

Strongyloides venezuelensis

S. venezuelensis HH1 isolate, which has 100% direct, homogonic development under

standard culture conditions7, was used. This was maintained in the parasitology laboratory

of the University of Miyazaki by serial passage in male Wistar rats purchased from Kyudo

Co. Ltd. (Kumamoto, Japan). This work was performed in accordance with the procedures

approved by the Animal Experiment Committee of the University of Miyazaki under an

approval no. 2009-506-6, as specified in the Fundamental Guidelines for Proper Conduct

of Animal Experiment and Related Activities in Academic Research Institutions under the

jurisdiction of the Ministry of Education, Culture, Sports, Science and Technology, Japan,

2006.

Fecal cultures using filter paper were maintained at 27 °C for two days, as previously

described7. The iL3s were cleaned by three washes in PBS and stored as a pellet at -

80°C. DNA was isolated using the QIAamp DNA Mini Kit (Qiagen) following the

manufacturer‟s instructions.

Collection of parasite material for S. venezuelensis optical mapping. iL3s prepared as

described above for DNA sequencing were further purified by sucrose flotation8 and

immediately used for DNA extraction for optical map using the CHEF Mammalian Genomic

DNA Plug Kit (BioRad). To do this approximately 2,000 live iL3s were mixed with 1 mL of

pre-heated (50 °C) 0.75% (w/v) low melting point agarose in cell suspension buffer, and

transferred into agarose plug moulds. After solidification at 4 °C the plugs were incubated

in 2 mg/mL proteinase K and 1 mg/mL dithiothreitol for two days at 50 °C without agitation.

Deproteinized DNA-containing agarose plugs were washed in buffer at room temperature

for 1 h and stored in 0.5 M EDTA at 4 °C until use.

Nature Genetics: doi:10.1038/ng.3495

Page 5: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

3

Parastrongyloides trichosuri

P. trichosuri was maintained as continuous free-living cultures9 as described previously10.

These cultures were initiated by pooling several grams of feces collected from five

naturally infected, wild caught possums (Trichosurus vulpecula) from the central North

Island of New Zealand. This work was carried out with approval from the La Trobe

University Animal Ethics Committee under approval AEC11-48, and the Wallaceville

Animal Ethics Committee, Wallaceville Animal Research Centre, Upper Hutt, New

Zealand.

Mixed stage free-living males and females were collected by washing worms from the

surface of the agar plate in C. elegans M9 buffer and then collected by centrifugation at

200 g, and washed several times in M9 by centrifugation. The worms were then pipetted

c. 20 mL M9 in a 90

mm diameter petri dish and allowed to migrate through the mesh overnight at c. 20 oC to

separate living and dead worms and to remove any remaining debris. The living worms

were collected by transferring the liquid to a centrifuge tube, which was left on ice for c. 60

min. As much of the supernatant as possible was removed and the resultant pellet snap

frozen in liquid nitrogen and stored at -80 oC until used. DNA was prepared from a 0.5 mL

packed volume of worms using a Qiagen DNeasy Blood and Tissue genomic DNA

extraction kit following the manufacturer‟s instructions for cells and tissues. The yield was

measured by fluorimetry in a Qubit instrument and assessed for purity and size by agarose

gel electrophoresis through a 0.8% (w/v) agarose gel in TAE buffer.

Rhabditophanes sp.

Rhabditophanes sp. KR3021, originally isolated by Ann M. Rose near Vancouver, British

Columbia, is a parthenogenetic nematode that grows in the laboratory under conditions

similar to those used for C. elegans. It was grown at 20 °C on NGM agar plates11 seeded

with Escherichia coli OP50 as a food source.

Approximately 100,000 KR3021 nematodes were harvested from agar plates and washed

five times in M9 buffer. The nematodes were then subject to five freeze/thaw cycles

(alternating between -80 °C and room temperature) to rupture cuticles for subsequent

DNA extraction. Genomic DNA was prepared using a Qiagen DNeasy tissue miniprep kit,

following the manufacturer‟s protocol. Genomic DNA was then analyzed by 0.8% (w/v)

agarose gel electrophoresis and NanoDrop to assess DNA quality and quantity.

2. Collection of sex-specific material for re-sequencing to investigate chromatin

diminution and X chromosomal regions

S. ratti and S. papillosus adult males and females

We separately collected parasite material that was genetically female and male in the

following way.

Nature Genetics: doi:10.1038/ng.3495

Page 6: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

4

Females (iL3s): Feces from S. ratti infected rats was cultured at 19 °C for six days, and

feces from S. papillosus infected rabbits was cultured at 25 °C for six days. By this time

the cultures consisted largely of iL3s, which are all females. The iL3s were collected using

Baermann funnels as previously described6. The few adults in the samples were removed

manually. The iL3s were washed repeatedly with tap water and transferred into a 1.5 mL

microcentrifuge tube. The worms were allowed to sediment and, after as much liquid as

possible had been removed, frozen in liquid nitrogen.

Males: Feces from infected rats or rabbits (for S. ratti or S. papillosus, respectively) were

cultured at 19 °C or 25 °C, respectively for three or two days, respectively. By this time the

cultures contained a large proportion of free-living adult males and females. The worms

were collected using Baermann funnels as described6 and washed repeatedly with tap

water. Males were isolated with a pipette and transferred into a 1.5 mL microcentrifuge

tube. The worms were allowed to sediment and, after as much liquid as possible had been

removed, frozen in liquid nitrogen.

S. ratti young larvae and sex ratio

Feces from infected rats were collected over night. The worms were isolated using the

Baermann technique and washed multiple times with tap water. A sample of approximately

150 worms was transferred onto an NGM plate and allowed to develop to adulthood in

order to determine the sex ratio. The remaining worms were allowed to sediment. As much

liquid as possible was removed and the tubes frozen in liquid nitrogen.

S. stercoralis

Free-living males and females were isolated from charcoal coprocultures incubated at

22 C for 48 h using the Baermann technique as previously described5. The worms

resulting from the funnels were sedimented by gravity, washed twice in deionized water,

resuspended in M9 buffer and placed onto NGM agar plates11. Free-living males and

females were selected by manual pipetting, placed into separate aliquots of M9 buffer and

DNA was extracted immediately.

P. trichosuri adult males and females

The worms were washed from the plates with M9, concentrated using Baermann funnels

and washed multiple times with M9. Adult males and females were isolated with a pipette

and transferred into a 1.5 mL microcentrifuge tube in M9 buffer. The worms were allowed

to sediment and, after as much liquid as possible had been removed, frozen in liquid

nitrogen.

P. trichosuri young larvae and sex ratio

The worms were washed from the plates with tap water, concentrated using the Baermann

technique and washed multiple times with tap water. Young larvae were enriched by

collecting the slower sedimenting fraction. The remaining adult and later stage larvae were

removed manually. A sample of approximately 150 worms was transferred onto NGM

Nature Genetics: doi:10.1038/ng.3495

Page 7: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

5

plates and allowed to develop to adulthood in order to determine the sex ratio. The

remaining worms were allowed to sediment. As much liquid as possible was removed and

the tubes frozen in liquid nitrogen.

Genomic DNA from S. ratti, S. papillosus and P. trichosuri male and female samples was

prepared by lysing in TEN (20 mM Tris pH 7.5, 50 mM EDTA, 100 mM NaCl) buffer with

1% (w/v) SDS and 200 μg proteinase K at 55 oC for 4 h. The resulting lysate was treated

with RNase A and protein precipitation solution (Promega), and the DNA was precipitated

in 0.1 x volume of 3 M sodium acetate and 3 x volume of ethanol with glycogen as a co-

precipitant. The pellet was resuspended in EB (Qiagen) and quantified on Qubit.

Genomic DNA was prepared from S. stercoralis free-living males and females as

described above for S. stercoralis genomic DNA preparation for genomic sequencing.

3. Determination of sex chromosome-specific sequences in S. ratti, S. stercoralis,

and P. trichosuri, and chromatin diminution regions in S. papillosus

Genetic markers

We assigned 37 S. papillosus genetic markers, known to lie either inside or outside

regions of chromatin diminution10,1210,1210,1210,1210,1210,1210,1210,12, to their unique locations in

the S. papillosus genome assembly (Supplementary Table 5). Using Gap513, assembly

errors within each marker-containing scaffold were manually resolved where possible.

Manual inspection was also used to link additional contigs into marker-containing

scaffolds. This identified six scaffolds that belong to the eliminated part of the S. papillosus

X-I fusion chromosome, and four scaffolds that belong to non-eliminated parts of that

chromosome.

Re-sequencing males and females.

To obtain a fine-scale map of the eliminated regions on the S. papillosus X-I fusion

chromosome, we re-sequenced DNA from iL3s (female) and males. The re-sequencing

reads were mapped to the reference genome using SMALT v0.7.4 (Hannes Ponstingl,

pers. comm.) (using indexing options -k 13 -s 2 and mapping options -y 0.75 -x -r 0 -i

1000). Scaffolds smaller than 8 kb were discarded due to excessive noise in read-depth

estimates. For all other scaffolds, the median read depth was calculated for 10 kb windows

using BEDTools function genomecov14. A scaffold-specific read depth was then defined as

the median value of the median read depths for its 10 kb windows. Such scaffold-specific

read depths were calculated separately for mapped sequencing reads derived from male

and female samples.

In a first round of analysis, a scaffold was then preliminarily classified as autosomal if,

using data from males and females independently, its scaffold-specific median read depth

was 0.75 - 1.25 times that of all scaffolds taken together.

Nature Genetics: doi:10.1038/ng.3495

Page 8: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

6

A more accurate classification was then carried out in a second round of analysis in which

the autosomal median read depth was defined as the median read depth across all

scaffolds greater than 100 kb in length, that had been preliminarily classified as autosomal

in the first round of analysis. A scaffold was then conclusively classified as autosomal if,

using data from males and females independently, the scaffold-specific read depth was

0.75 – 1.25 times the autosomal median read depth.

However, a scaffold was classified as sex-chromosomal or as having undergone

diminution if its read depth was 0.75 – 1.25 times the autosomal median read depth based

on female derived data but 1.5 – 2.5 times the median, using male-derived data.

All scaffolds that were not classified as autosomal nor as sex-chromosomal / diminuted

were classified as undetermined.

The classification of scaffolds based on re-sequencing data agreed with six scaffolds

identified as eliminated and four as non-eliminated based on genetic markers (see Genetic

markers above), validating the read depth approach.

4. Collection of parasite material for RNA-seq and proteomic analysis

S. ratti

Free-living adult females. Feces were collected from S. ratti-infected rats and cultured at

25 oC. After two days adult worms in the liquid surrounding the fecal pellets were collected

and individual male and female worms were removed with a pipette. Worms were placed

in a watchglass and washed with M9 buffer at least twice to remove contaminating fecal

material. Aliquots of approximately 500 nematodes were stored as a pellet of worms (for

proteomic analysis) or in 200 µL TRI reagent (for RNA-seq analysis) and snap frozen in

liquid nitrogen. Samples were stored at -80 oC.

Parasitic adult females. Parasitic adult females were harvested from the small intestine of

sacrificed S. ratti-infected rats at 6 days post infection (d.p.i.), and cleaned using a Percoll

gradient as previously described15. Aliquots of approximately 250 nematodes were stored

as a pellet of worms (for proteomic analysis) or in 200 µL TRI reagent (for RNA-seq

analysis) and snap frozen in liquid nitrogen. Samples were stored at -80 °C.

iL3s. Fecal material was collected from S. ratti-infected rats. This was cultured and

maintained at 19 °C, such that after three days iL3s had migrated to the water surrounding

the fecal culture, from where they were harvested following procedures previously

described1. Samples containing c.15,000 iL3s were stored in 200 µL TRI reagent and snap

frozen in liquid nitrogen. Samples were stored at -80 °C.

Experimental design. We used the same dissected rats (for parasitic females) and the

same fecal cultures (for free-living females) for both the RNA-seq and proteomic material,

Nature Genetics: doi:10.1038/ng.3495

Page 9: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

7

such that any material collected was divided approximately in half and used for these two

analyses. In this way any potential experimental noise resulting from collecting samples at

different times and from different host rats was therefore minimized. Two biological

replicates were collected and processed for parasitic females and for free-living females. A

single biological sample of iL3 nematodes was used (a 2 μg aliquot from 19 μg RNA

extracted from a sample of 66,000 iL3s) for RNA-seq analysis.

S. stercoralis

Previously published transcriptome data for S. stercoralis was used4.

S. venezuelensis

Eggs and L1s. Feces were collected from S. venezuelensis-infected rats, and separated

using a saturated salt solution flotation method16; the eggs were washed extensively with

water and stored at -80 ºC prior to RNA extraction. First stage larvae (L1s) were prepared

from eggs (collected as described above) by incubating them for 24 h in PBS at 27 ºC.

Hatched L1s were sedimented by centrifugation.

iL3s were collected as described above for generating the genome sequencing material,

with the exception that they were stored in 250 μL of TRI reagent at -80 ºC prior to RNA

extraction.

Lung-iL3 and induced-iL3s. In vivo tissue-migrating third stage larvae (lung-iL3) were

collected from the lungs of sacrificed S. venezuelensis-infected ICR mice 72 h after

infection with 30,000 S. venezuelensis larvae. In vitro tissue-migrating third stage larvae

(induced-iL3s) were generated by incubating iL3 nematodes (obtained from fecal cultures)

in DMEM (4.5 g/L D-glucose, L-Glutamine, Life Technologies) supplemented with

antibiotics (0.25 mg/mL gentamicin, Life Technologies) at 37 ºC in a 5% (v/v) carbon

dioxide atmosphere for 24 h. The larvae were collected in 2 mL tubes, mixed with 250 μL

of TRI reagent and stored at -80 °C.

Gravid parasitic females and young parasitic adult females. Gravid parasitic females were

collected from the intestines of S. venezuelensis-infected rats at 7 d.p.i. Young parasitic

adult females were collected in the same way but from rats 80 h post infection. All

collected worms were washed three times in PBS and stored at -80 °C.

S. venezuelensis RNA-seq data (Supplementary Table 24) was used as evidence for gene

prediction and for training gene finding software (See Supplementary Note 9, below).

5. Library preparation and genome sequencing

Illumina sequencing of S. stercoralis, S. papillosus, P. trichosuri and

Rhabditophanes

Nature Genetics: doi:10.1038/ng.3495

Page 10: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

8

PCR-free 200 – 400 bp paired-end Illumina libraries were prepared from 40 – 1,000 ng

genomic DNA as previously described17 except that Agencourt AMPure XP beads were

used for sample clean up and size selection. DNA was precipitated onto the beads after

each enzymatic stage with a 20% (w/v) Polyethylene Glycol 6000 and 2.5 M sodium

chloride solution, and beads were not separated from the sample throughout the process

until after the adapter ligation stage. Fresh beads were then used for size selection.

Between 1,500 and 5,000 ng of genomic DNA from S. stercoralis, S. papillosus, P.

trichosuri and Rhabditophanes was used to generate 3 kb mate-pair libraries using a

modified SOLiD 5500 protocol adapted for Illumina sequencing18. Library details are in

Supplementary Table 23.

Libraries were sequenced on the Illumina Genome Analyser IIx or HiSeq 2000 for 76 or

100 cycles using the TruSeq PE Cluster kit v4 and the TruSeq SBS Kit v5, according to

the manufacturer's recommended protocol (icom.illumina.com). Data were analyzed from

the Illumina HiSeq sequencing machines using the RTA1.8 analysis pipeline.

Illumina sequencing of S. venezuelensis

A paired-end sequencing library (500 bp) was prepared using the TruSeq DNA Sample

Prep kit (Illumina). Two mate-paired libraries (3.0 and 5.0 kb) were constructed using the

SOLiD Mate-Paired Library Construction kit (Applied Biosystems). In the final step of the

mate-paired library construction, Illumina adapters for the sequencing library were used

instead of SOLiD adapters. Each sample was purified using the Agencourt AMPure XP kit

(Beckman Coulter) and target DNA fragments for each sample were extracted from the

agarose gel. All libraries were sequenced on the Illumina HiSeq 2000 sequencers using

the Illumina TruSeq PE Cluster kit v3 and TruSeq SBS kit v3 (101 cycles x 2). The raw

sequence data were analyzed using the RTA 1.12.4.2 analysis pipeline and were used for

genome assembly after removal of adapter, low quality, and duplicate reads.

Sanger sequencing of S. ratti

We produced 436,244 Sanger sequencing reads from plasmids (p0TW12, pUC19 and

pMAQ1Sac_BstXI) containing fragments of S. ratti genomic DNA (Supplementary Table

23). The libraries were cultured and DNA extraction performed in 96- and 384-well

formats, respectively. Library end-sequencing was performed using ABI BigDye version

3.1 and standard primers, and analyzed on an ABI 3730 Capillary DNA Analyser

(Supplementary Table 23).

454 sequencing of S. ratti

Between 1000 and 7500 ng of S. ratti genomic DNA was used to produce paired-end (3 kb

and 8 kb) and shotgun 454 libraries (Supplementary Table 23) using standard Roche

protocols and sequenced using the 454 Life Sciences GS-20 and GS-FLX sequencer

(Roche).

6. Optical map of S. venezuelensis

Nature Genetics: doi:10.1038/ng.3495

Page 11: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

9

High molecular weight genomic DNA (see Supplementary Note 1) was mapped using

Argus Optical Mapping System (OpGen) by stretching and immobilizing it within

microfluidic channels before digestion with the restriction endonuclease SpeI, to yield a set

of restriction fragments ordered according to their positions within the genome. The

fragments were fluorescently stained and visualized to determine fragment sizes.

Assembling overlapping fragment patterns of single molecule restriction maps produced

an optical map of the genome, which was used to improve the genome assembly of

Illumina data (see Supplementary Note 7).

7. Assembly and manual improvement

Genome assembly statistics are detailed in Supplementary Table 1.

S. ratti

The initial assembly of the S. ratti genome utilized Sanger capillary, 454 and Illumina

sequence data (Supplementary Table 23) and was screened against host and other

sequences. Capillary and 454 reads were assembled together using Newbler v2.319.

Illumina reads were assembled with ABySS v1.3.120. The scaffolds and contigs from the

Newbler and ABySS assemblers were manually merged using Gap421 from alignments

identified with nucmer22. Illumina reads were used to close gaps23 and correct consensus

sequence errors24. This resulted in S. ratti genome assembly v4.

Illumina paired-end reads were mapped to S. ratti genome assembly v4 with SMALT

v0.6.2 (H. Ponstingl and Z. Ning). Using Velvet 25, a de novo assembly was performed

using only the „bin‟ of paired-end reads that did not map to the main assembly. Contigs

were manually linked into scaffolds and 454 read-pair information and data from the „bin‟

assembly were used to correct assembly errors. REAPR26 was also used to identify further

assembly errors for manual correction. Genetic markers27 mapped to the assembly

enabled the genome assembly to be organized into three chromosomal scaffold groups,

producing S. ratti genome assembly v5.

S. stercoralis, S. papillosus, P. trichosuri, Rhabditophanes sp.

The genome assemblies for S. stercoralis, S. papillosus, P. trichosuri and Rhabditophanes

sp. were assembled from short paired-end and 3 kb mate-pair Illumina sequence data. For

each species, short fragment reads were first corrected and assembled from the SGA

assembler28 (v0.9.7). This SGA assembly was only used to calculate the distribution of 41-

71 k-mers using GenomeTools29 v.1.3.7 and not used subsequently. The corrected

Illumina reads from the SGA assembler and the most frequently occurring k-mer length

from the SGA assembler were used to generate a second assembly using Velvet25. Using

SSPACE30, long insert mate-pair reads were used to produce larger scaffolds from the

Velvet assembly. IMAGE23 and Gapfiller31 were used to close gaps and extend contigs.

The short fragment reads were remapped to the assemblies using SMALT (H. Ponstigl,

Nature Genetics: doi:10.1038/ng.3495

Page 12: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

10

pers. comm.), and a „bin‟ assembly was generated using Velvet25, as described above,

and incorporated into the main assembly. Scaffolds were broken into contigs and re-

scaffolded using SSPACE. iCORN24 was used to increase the consensus base quality of

the assembly and REAPR was used to detect miss-assemblies and break apart scaffolds

and contigs where necessary. The resulting assemblies were termed v1.

Scaffolds were extended, linked and, where possible, errors detected by REAPR were

manually corrected using Gap513. Automated gap closure was undertaken using IMAGE

and Gapfiller, and the accuracy of the consensus sequence was improved using iCORN to

produce v2 genome assemblies (Supplementary Table 1). The v2 genome assemblies for

S. stercoralis, P. trichosuri and Rhabditophanes sp. were then subjected to contamination

scans (see section Contamination scan below) and used for gene finding and subsequent

analysis.

Genetic markers12 were mapped to the S. papillosus v2 assembly to produce v2.1, which

was scanned for contamination and used for gene finding.

S. venezuelensis

Due to the fact that (i) paired-end and mate-pair Illumina sequence reads of various insert

sizes were available (Supplementary Table 23), and (ii) the S. venezuelensis material

used was genetically diverse, the Platanus assembler32 was used to produce genome

assembly v1. While undertaking manual improvement on the S. venezuelensis genome

assembly v1 assembly, as described above for S. stercoralis, S. papillosus, P. trichosuri

and Rhabditophanes, HaploMerger33 was run and contigs for which the depth of re-

mapped paired reads was less than 55x (median coverage was 110x) were removed. This

resulted in 7 Mb of haplotypic sequences being removed from the assembly with just a

0.5% decrease in read mapping. The resulting v2 S. venezuelensis assembly was used for

gene finding after undergoing contamination scans.

S. venezuelensis genome assembly v2 was further improved by optical mapping (see

above). The S. venezuelensis optical map consisted of 17 contigs, an assembled size of

83.2 Mb and approximately 90x genome coverage of optical data. Aligning sequence

scaffolds over 80 kb to the optical map generated approximately 39.56 Mb of alignment.

The optical map data were used to order and orientate sequence scaffolds, to measure

the size of sequence gaps, and independently validate the sequence assembly, producing

genome assembly v3, which was used to investigate synteny with S. ratti.

Contamination scan

For some of the species sequenced, the reads were contaminated with reads from other

species, either arising from DNA of the host species or other species that are commensal

in the host, or (less likely) from laboratory contamination. To remove contaminant scaffolds

from the assemblies, we took a multi-step approach:

Nature Genetics: doi:10.1038/ng.3495

Page 13: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

11

Each scaffold was first split into 50 kb sections and, using BLASTX34, searched against a

database of all invertebrate proteins from GenBank and against a database comprising

proteins from representative species from the major non-invertebrate taxa (from bacteria,

vertebrates, fungi, plants, etc.). For a particular section, if the E-value for its top non-

invertebrate alignment was 1e10 times lower than the E-value for its top invertebrate

alignment, the section was considered to be a contaminant. If more than half of the

sections of a scaffold were classified as contaminant, the whole scaffold was considered a

contaminant and was removed from the assembly.

Next, the computational translations of predicted genes on the non-contaminant scaffolds

(from step one above) were searched, using BLASTP, against the two databases used in

step one (non-invertebrate and invertebrate proteins). For each protein, if its top BLASTP

alignment was to a non-invertebrate protein, and had an E-value that was 1e50 times lower

than that of the best invertebrate alignment, then the gene was considered a putative

contaminant. Conversely, if the top alignment was to an invertebrate protein, and its E-

value was 1e50 times lower than that of the best non-invertebrate alignment, the gene was

classified as non-contaminant. If more than half of the classified genes on a scaffold were

considered contaminants, then the scaffold was classified as a contaminant and removed

from the assembly.

The third step was a slightly more stringent version of the second step, designed to

remove contamination originating from other invertebrates, as well as any residual

contamination from non-invertebrates (e.g. bacteria) not removed by the first two steps. All nematode and flatworm protein sequences from GenBank were downloaded. Taking all

the predicted proteins encoded on scaffolds remaining after step two, we ran BLASTP

against a database consisting of the non-invertebrate proteins (from step one) plus the

flatworm sequences. We also ran BLASTP against the nematode sequences. For each

query gene, we looked at its top five BLASTP alignments in the flatworm/non-invertebrate

database and in the nematode database. If the top five of these ten BLAST alignments

were to flatworm/non-invertebrate, and the E-value of the worst flatworm/non-invertebrate

alignment was at least 1e5 times lower than the E-value of the best nematode alignment,

we considered it to be a contaminant gene. Conversely, if the top five of the ten

alignments were to nematode, and the E-value of the worst nematode alignment was 1e5

times lower than that of the best flatworm/non-invertebrate alignment, it was considered to

be a non-contaminant gene. If a scaffold had one or more contaminant genes, and no non-

contaminant genes, it was considered to be a contaminant scaffold and was removed.

Assembly completeness

To assess the completeness of the assemblies we used CEGMA v235 on each of the

assemblies (Supplementary Table 1). CEGMA reports the percentage of 248 highly

conserved eukaryotic gene families that are present as full or partial genes in the

assembly. For most eukaryotes, 100% (or nearly 100%) of CEGMA families represented

Nature Genetics: doi:10.1038/ng.3495

Page 14: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

12

by a full gene in the genome would be expected. Thus, CEGMA provides a measure of the

completeness of the assembly for a species.

8. Repeat analysis

Repeats within the assemblies of the six species were identified using the combined

outputs of RepeatModeler and TransposonPSI. For each species, UCLUST36 was used to

cluster repeat sequences from RepeatModeler and TransposonPSI that had ≥80% identity,

to generate consensus sequences for a non-redundant repeat library. RepeatMasker

(v.3.2.8) was then run (using the slow search option) with a custom repeat library for each

species, to calculate the distribution of each repeat and its abundance in the genome.

9. Gene finding

Training Augustus.

To predict protein-coding genes, Augustus 2.6.137 was first trained for each species based

on a training set of 197-423 non-overlapping, non-homologous and manually curated

genes (248, S. ratti; 385, S. stercoralis; 197, S. papillosus; 423, S. venezuelensis; 203, P.

trichosuri; 233, Rhabditophanes). The initial gene predictions that were used to curate the

training set were predicted by CEGMA35, exonerate38, using aligned ESTs (see Creating

hints for Augustus based on EST and protein alignments) and by using RATT39 to project

curated S. ratti gene models on to the genomes of the other species. A selection of gene

models were curated in Artemis40 using aligned RNA-seq data (for S. ratti, S. stercoralis

and S. venezuelensis; Supplementary Table 24 and previously published data4) and

BLAST34 matches against the NCBI database.

Creating hints for Augustus based on RNA-seq data.

„Hints‟ to guide gene prediction by Augustus were generated from aligned RNA-seq data,

EST data, and S. ratti predicted proteins, as previously described41. For S. ratti, S.

stercoralis and S. venezuelensis, „intron‟ and „exonpart‟ hints were made from mapped

RNA-seq reads: four S. ratti life stages, seven S. stercoralis life stages, and 13 S.

venezuelensis life stages (Supplementary Table 24). RNA-seq reads were mapped to the

genomes using TopHat242 (parameters: –a 6 –i 10 –I 20000 --microexon_search --min-

segment-intron 10 --max-segment-intron 20000). Based on the TopHat2 alignments, the

bam2hints program (part of the Augustus package) was used to create the intron hints,

with minimum length set to 15 bp. The mapped RNA-seq reads were also assembled into

transcripts using Cufflinks43, again with minimum intron length of 15 bp. A different

transcript assembly was made for each sample, and the transcript assemblies were then

merged using cuffmerge43. The predicted exons in the resultant set of transcripts were

used as the exonpart hints.

Creating hints for Augustus based on EST and protein alignments.

Nature Genetics: doi:10.1038/ng.3495

Page 15: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

13

For S. papillosus, S. stercoralis and P. trichosuri we made „exonpart‟ and „intron‟ hints

based on EST alignments, while for S. stercoralis, S. papillosus, P. trichosuri, and

Rhabditophanes, we also made „CDSpart‟ and „intron‟ hints based on S. ratti protein

alignments. For P. trichosuri, the EST data consisted of 3146 NEMBASE44 clusters and

7963 ESTs from the NCBI EST database45; and for S. stercoralis 3708 EST clusters from

NEMBASE and 11,392 ESTs from NCBI. For S. papillosus there were 83 ESTs from

NCBI, and 120,798 ESTs from iL3s and free-living adults produced by The Genome

Institute, Washington University, St. Louis, U.S.A.12.

For each EST or S. ratti protein we found the top BLAST34 alignment in the genome using

an E-value cut-off of 0.05. We then used exonerate38, with --model coding2genome --

bestn 1 for ESTs and --model protein2genome --bestn 1 for proteins, to make a gene

prediction by aligning the EST or protein to a region comprising the BLAST hit plus 25,000

bp on either side. Introns of 15-350,000 bp predicted by exonerate were used as intron

hints for Augustus. In addition, any exon features predicted using ESTs were used as

exonpart hints, and CDS features predicted using proteins as CDSpart hints.

Running Augustus.

The species-specific trained versions of Augustus were run using all the hints for that

species as input. Introns starting with „AT‟ and ending with „AC‟ were allowed (--

allow_hinted_splicesites=atac). A weight of 105 was given to intron and exonpart hints

from RNA-seq and EST alignments, and 103 to intron and CDSpart hints from alignments

of S. ratti proteins. The minimum intron length was set to 15 bp. Augustus was also run

without hints, and a combined gene set made by taking genes predicted using hints and

any non-overlapping genes predicted without hints. If Augustus predicted multiple,

alternatively spliced transcripts for a gene, we only kept the transcript corresponding to the

longest predicted protein.

Combining Augustus and MAKER predictions.

The Augustus gene predictions from above were supplemented with non-overlapping

predictions from MAKER46. The MAKER pipeline consisted of four steps. Firstly, repetitive

elements in each genome were identified and masked using RepeatMasker by scanning

scaffolds for matches to repeats from RepeatRunner47; Repbase48; and a species-specific

repeat library generated using RepeatModeler. Secondly, ab initio gene models to be used

as evidence within MAKER were generated using Augustus 2.5.549, GeneMark-ES 2.3a

(self-trained)50, and SNAP51 2013-02-16. Further gene models used as MAKER input were

generated using comparative algorithms genBlastG52 (which used comparisons to C.

elegans gene models from WormBase53) and RATT39 (which transferred S. ratti gene

models to our other five species). Thirdly, species-specific ESTs and cDNAs from

INSDC54, and proteins from related species (see below), were aligned against the

genomes using BLASTN and BLASTX34, respectively, and these alignments were further

refined with respect to splice sites using exonerate38. Finally, the EST and protein

homology alignments, comparative gene models, and ab initio gene predictions were

Nature Genetics: doi:10.1038/ng.3495

Page 16: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

14

integrated and filtered by MAKER to produce a gene set for each species, with just one

transcript predicted for each gene.

This four-step MAKER pipeline was run three consecutive times. The first run was

performed using the est2genome and protein2genome options, using species-specific

ESTs and cDNAs, and nematode protein sequences, respectively. The nematode proteins

used were UniRef 90 clusters for nematodes from UniProt55. For this first MAKER run,

Augustus and SNAP were trained using CEGMA56 gene models for KOGs, as well as

nematode orthologous groups created using OrthoMCL57 to cluster proteins from the full

proteomes of 12 nematode species and several eukaryotic outgroups (Makedonka

Mitreva, Washington University, St. Louis, U.S.A. personal communication). Clusters

containing at least one member from each of the 12 nematodes were kept, and hidden

Markov Models (HMMs) were built for the clusters using HMMER58. Gene models obtained

from the first MAKER run were used to train SNAP, and MAKER was run a second time,

using the same nematode proteins as in the first run.

Gene models from the second run were then used to train Augustus. Using the trained

versions of SNAP and Augustus, MAKER was run a third time, using a taxonomically

broader protein set that included proteins from metazoans from complete proteomes from

UniProt and a subset of proteins from helminths from GeneDB59. The resulting MAKER

gene set was filtered to remove less reliable gene models, as follows. Firstly, any MAKER

gene models that were based on exonerate or BLASTX alignments, and did not overlap

any Augustus, genBlastG or RATT gene model, were discarded, as they were probably

due to spurious alignments. Secondly, MAKER gene models that encoded proteins of

shorter than 30 amino acids were discarded. Thirdly, if two different MAKER gene models

overlapped in their coding sequence, the gene model with the worse MAKER score (i.e.

AED score) was discarded.

Manual curation of gene models.

To ensure that phylogenetic analyses of the astacin gene family were based on accurate

gene models, all astacin gene models from S. ratti (184 genes), S. stercoralis (237 genes)

and S. venezuelensis (217 genes) were manually curated in Artemis40, using RNA-seq

data and BLAST34 search results against the NCBI database (Supplementary Table 24) to

refine gene structures. In addition to astacin coding genes, 54 SCP/TAPS coding genes

from different species were also manually curated (26 S. papillosus, 17 S. stercoralis, 7 S.

venezuelensis, 3 S. ratti, 1 Rhabditophanes). If a gene was manually curated, it replaced

the original Augustus/MAKER gene prediction in our final gene set.

10. Functional annotation

Assigning protein names to predicted proteins.

Unique names were assigned to each predicted protein, following UniProt‟s protein

naming guidelines, where possible. For each predicted protein of interest an ortholog was

Nature Genetics: doi:10.1038/ng.3495

Page 17: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

15

identified that had a manually curated protein name, in UniProt55 (taking human,

zebrafish, Drosophila melanogaster, C. elegans, and Schistosoma mansoni orthologs) or

GeneDB59 (S. mansoni orthologs). One-to-one or many-to-one (e.g. many-S. ratti to one-

C. elegans) orthologs were first identified based on phylogenetic trees from an in-house

Ensembl Compara60 database that included the Strongyloides spp. data, together with

human, zebrafish, D. melanogaster, C. elegans and S. mansoni. The correspondence

between UniProt accessions and Ensembl accessions (used in the Ensembl Compara

database) were obtained from UniProt. In order of preference, an ortholog with a manually

curated protein name was selected from: C. elegans, S. mansoni, human, D.

melanogaster, then zebrafish. From UniProt, the recommended name of the ortholog was

used; from GeneDB, the product description was used. Where no ortholog with a manually

curated protein name was found, an ortholog with a non-curated protein name (i.e. from a

TrEMBL entry55) was used. The protein name was then transferred to the predicted protein

of interest and the UniProt/GeneDB accession of the source (ortholog) protein was noted;

the evidence code for the protein name was recorded as ECO:0000265 („sequence

orthology evidence used in automatic assertion‟) using the Evidence Code Ontology. If

several genes in a species of interest were assigned the same protein name (for example,

because of many-to-one orthology to the same C. elegans gene), they were numbered

sequentially to give unique names. If a particular query protein was not assigned any

protein name based on its orthologs, then a protein name was assigned based on

InterPro61 domains in the protein, as recommended by UniProt. The InterPro accession(s)

of the source domains were noted, and the evidence code for the protein name was

recorded as ECO:0000259 ('match to InterPro signature evidence used in automatic

assertion'). If a query protein was not assigned a protein name based on either orthologs

or InterPro domains, it was named „hypothetical protein‟. The protein names were added to

the protein FASTA file headers for each species.

11. Identification of gene families, orthologs and paralogs

To establish orthology relationships among representative nematode species, non-

redundant proteomes of eight nematode outgroup species (Supplementary Table 1) were

obtained from WormBase53 (version WS244). These outgroup species spanned four

nematode clades, as previously defined62: clade I, Trichinella spiralis, Trichuris muris;

clade III, Ascaris suum, Brugia malayi; clade IV, Bursaphelenchus xylophilus, Meloidogyne

hapla; clade V, Necator americanus, C. elegans. An Ensembl Compara60 database was

constructed based on these species along with the computationally predicted proteomes

of the six species in this study. The Ensembl Compara build pipeline required an input

species tree. To construct this 951 orthoMCL57 clusters were identified with one gene per

species that, when aligned using MAFFT63, produced an alignment where less than 20%

of columns had gaps. We concatenated the 951 alignments and from this built a maximum

likelihood tree using RAxML v8.0.2464 with the Gamma model of amino acid substitution

and 100 bootstrap replicates (-f a –m PROTGAMMAILGF –N 100). This tree contained the

clade (((S. ratti, S. stercoralis), (S. venezuelensis, S. papillosus)), P. trichosuri),

Nature Genetics: doi:10.1038/ng.3495

Page 18: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

16

Rhabditophanes. As the input tree we used a tree with the clade above for our species,

with the remainder of the tree based on one previously described62: ((Trichuris,

Trichinella),((Ascaris, Brugia),(Caenorhabditis, Haemonchus),((Bursaphelenchus,

Meloidogyne), (clade of our species)))). The complete set of Compara gene families of our

six species and the eight outgroups is given in Supplementary Table 8.

The Ensembl Compara database was queried using the Ensembl Perl API, to identify

orthologs, paralogs, gene duplications, gene losses, and shared or species-specific gene

families among these nematode species. Orthologs, within-species paralogs, and the

members of each gene family were retrieved from the database using standard methods in

the Perl API. Gene losses and duplications along the species tree and gene families were

identified by the presence of „lost taxa‟ and „duplication_confidence_score‟ tags,

respectively, on the nodes of the Ensembl Compara gene trees. We only considered

duplication events were the duplication confidence score was higher than 0.0. Clade- and

species-specific gene families were identified by finding the taxonomy level of the root

node of each gene tree.

Compara families were identified where all members were annotated as hypothetical

proteins. The six largest gene families from the Strongyloides clade were annotated as

Strongyloides genome project families (sgpf) 1-6. A further three families annotated as

hypothetical protein were found to be upregulated in parasitic females and these were

designated sgpf-7-9. See Supplementary Table 11.

Some large gene families, such as those encoding astacins, were divided into multiple

smaller Ensembl Compara gene families. Therefore, for the phylogenetic analysis of these,

all astacin coding gene families in the Compara database were manually identified and

merged into a new multiple alignment and phylogenetic tree containing all members.

12. Species tree reconstruction

From the Ensembl Compara database 4,437 gene families were identified that contain just

one gene from each species and are present in at least 10 out of 14 species. The proteins

in each family were aligned using MAFFT version v6.85766, poorly-aligned regions were

trimmed using GBlocks v0.91b, and then the 316 trimmed alignments were concatenated.

For each alignment, the best-fitting amino acid substitution model was identified as that

minimising the Akaike Information Criterion from the set of models available in RAxML

v8.0.2465, testing models with both pre-defined amino acid frequencies and observed

frequencies in the data, and all with the „CAT‟ model of rate variation across sites. A

maximum likelihood phylogenetic tree was produced based on the concatenated

alignment, with each protein alignment an independent partition of these data, applying the

best-fitting substitution model identified above to each partition. This inference used

RAxML v8.0.24 with 10 random addition-sequence replicates and 100 bootstrap

replicates, and otherwise default heuristic search settings.

Nature Genetics: doi:10.1038/ng.3495

Page 19: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

17

13. Analysis of intron-exon structure

Common introns were identified from gene structures and full gene nucleotide alignment of

the 208 single copy orthologs using SciPio65 and GenePainter66. The output from

GenePainter was parsed into DOLLOP67 to infer intron gain and losses on every node of

the species tree using maximum parsimony. The length of exons for outgroup species

were parsed from version WS244 GFF files from WormBase53.

14. Synteny analysis

The S. ratti genome was used as a reference for analyzing conservation of synteny among

species. Scaffolds of at least 10 kb from the other five species were unambiguously

assigned into chromosomal linkage groups based on unique nucleotide alignments to S.

ratti using the nucmer22 program. Regions of pairwise similarity containing ≥ 3 genes in the

same order and orientation, were defined for the six species using DAGchainer68.

Synteny between S. ratti and each of the other five genome assemblies was also

examined at low resolution using PROmer, which uses translated sequence69. Regions ≥5

kb with ≥60% amino acid identity were identified across scaffolds and contigs ≥1 Mb

(those smaller than 1 Mb were excluded from this analysis). Scaffolds and contigs from

species other than S. ratti were aligned to the S. ratti chromosomes and their positions

within the relevant chromosome color-coded to ease interpretation. Synteny was

visualized using Circos70.

15. Mitochondrial genomes

Using genes reported for the C. elegans71 or S. stercoralis72 mitochondrial genomes, initial

seeds were identified from nuclear assemblies of the six species based on BLAST

searches. Sequence reads indicative of mitochondrial origin in all six species were

identified and reassembled by iterative mapping to the initial seeds using the MITObim

assembler using k-mers 31 and 4573. Manual curation and circularization of the

mitochondrial genome were performed manually using Gap513 and Artemis40. Gaps in the

assemblies were filled either with capillary or PacBio sequence data followed by PCR

amplification of the gap regions by specific primers. The annotation of each mitochondrial

gene sequence was obtained from the MITOS74 and manually curated. The gene order in

each assembly was confirmed by PCR using primers specific to each protein-coding gene

pair. The existence of two mitochondrial molecules of Rhabditophanes sp. was also

confirmed by PCR.

A maximum likelihood tree was constructed using twelve mitochondrial genes for each

species, excluding Meloidogyne hapla for which too few data were available. For this we

used twelve conserved proteins, specifically all protein-coding genes in Strongyloides spp.:

Nature Genetics: doi:10.1038/ng.3495

Page 20: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

18

ATP synthase subunit 6 (atp6), cytochrome oxidase subunits 1–3 (cox1, cox2, cox3),

cytochrome b (cob), and NADH dehydrogenase subunits 1–6 and 4L (nad1, nad2, nad3,

nad4, nad5, nad6, and nad4L). Amino acid sequences were aligned before concatenation

with MAFFT75 (v7.164b) using the -L-INS-i option and the alignments cleaned using

GBlocks (v0.91b). After concatenation phylogenetic analysis was performed using

RAxML64 v7.2.8 using the best fitting empirical model of amino acid substitution with 1,000

bootstrap resampling replicates.

16. Gene ontology (GO)

Assigning GO terms to predicted proteins.

Gene Ontology (GO) terms were transferred from human, zebrafish, C. elegans, and D.

melanogaster orthologs to the predicted proteins from each of our six species. First,

manually curated GO annotations for human, zebrafish, C. elegans and D melanogaster

were obtained76, and filtered to exclude annotations not based on experimental evidence

(i.e. retaining only those with evidence codes IDA/IEP/IGI/IMP/IPI), annotations with a

„NOT‟ qualifier, and annotations to the GO:0005515 („protein binding‟) term, following the

criteria used by the Ensembl Compara project for projecting GO terms to vertebrate

orthologs60. For each predicted protein in a particular species, all orthologs (including one-

to-one, one-to-many, and many-to-many) of the gene in human, zebrafish, C. elegans, and

D. melanogaster were identified based on phylogenetic trees in an in-house Ensembl

Compara60 database (see Supplementary Note 10). To assign GO terms to a particular

query gene we identified its human, zebrafish, C. elegans and D. melanogaster orthologs

that had manually curated GO terms. Taking each pair of orthologs A, B from two different

species (e.g. a C. elegans ortholog and a zebrafish ortholog, but not two C. elegans

orthologs), we used a breadth-first search algorithm to find the last common ancestors of

their GO terms in the GO hierarchy. For example, if A has GO terms {A1, A2, A3} and B

has GO terms {B1, B2}, we found the last common ancestors of A1+B1, A1+B2, A2+B1,

A2+B2, A3+B1, and A3+B2. The GO terms assigned to the query gene were the union of

the last common ancestors of GO terms for all pairs of orthologs from two different

species. We removed any GO term from this set that is an ancestor (in the GO hierarchy)

of another term in the set. GO terms of the three possible types (molecular function,

cellular component and biological process) were assigned to the query protein in this way.

The UniProt55 accession of the source (ortholog) protein was noted, and the evidence

code for the GO terms was recorded as IEA („inferred from electronic annotation‟).

Our pipeline is based on Ensembl Compara‟s pipeline for transferring GO terms to

orthologs in vertebrate species60. In order to maximize the number GO terms, annotations

were projected from all orthologs, not just one-to-one orthologs (unlike the Ensembl

Compara pipeline that is optimized for transferring GO terms directly between relatively

closely related orthologs of vertebrate species). The last common ancestor terms of pairs

of orthologs was transferred in order to conserve GO terms across more distantly related

taxa. For each query protein GO terms were also assigned using InterproScan77, which

Nature Genetics: doi:10.1038/ng.3495

Page 21: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

19

identified InterPro61 domains in the protein and maps GO terms to the domains. The

InterPro accession(s) of the source domains were noted, and the evidence code for the

GO terms was recorded as IEA.

Analysis of enriched GO terms

GO analysis was performed using the R (v3.1.2) package TopGO78 v2.16.0 using a

Fisher‟s exact test with a P value cut-off of < 0.001. Genes upregulated in each stage of

the life cycle tested (compared with both other stages of the life cycle) were tested for

enrichment against all other genes in the genome.

17. RNA preparation and RNA-seq

S. ratti

RNA extraction. S. ratti parasitic, free-living adult females and iL3s in 500 μL of Trizol

(Invitrogen) were stored at -80 °C before use. An additional 500 μL of Trizol was added to

the frozen worm pellet and once thawed the contents transferred to MagNA Lyser Green

Beads (2 mL tube prefilled with 1.4 mm diameter ceramic beads, Roche Applied Science).

The sides of the original sample tube were washed with a further 500 μL of Trizol and this

was transferred to the MagNA Lyser green tube. Worms were homogenized by placing the

tube in a FastPrep (FP 120, Thermo Scientific), run at maximum speed for 3 x 20 s,

placing the tube on ice for 1 min between each run. Trizol solution containing dissolved

worms was then transferred to a new nuclease-free 1.5 mL microfuge tube and 200 μL of

chloroform:isoamyl alcohol (24:1 mix; Sigma-Aldrich) was added. The tube was inverted

and centrifuged at 13,000 g for 20 min at 4 °C in a pre-cooled centrifuge for phase

separation. The top, aqueous layer was removed and transferred to a new nuclease-free

1.5 mL microfuge tube. 1 μL of GlycoBlue (Invitrogen) was added to the sample followed

by 800 μL of isopropanol. The tube was placed at -80 °C for 2 h, then centrifuged at

13,000 g for 20 min at 4 °C in a pre-cooled centrifuge to pellet precipitated total RNA. The

supernatant was removed and the pellet was washed twice with 70% (v/v) ethanol (in

nuclease-free water). The pellet was dried briefly at room temperature before

resuspension in nuclease-free water. Total RNA was quantified using an Agilent 2100

Bioanalyser RNA Nano Chip (Agilent Technologies). The RNA samples were stored at -80

°C before use.

Messenger RNA (mRNA) isolation. mRNA was isolated from total RNA using 20 μL of

Dynabeads Oligo (dT)25 magnetic beads. The supernatant was removed in each step by

placing tubes on a high-strength magnetic rack (DynaMag-96 Side Magnet, Life

Technologies), for 2 min, so that the beads collected in a pellet on the side of the tube and

a pipette could be used to remove liquid without disturbing the beads. Beads were placed

in a 0.2 mL nuclease-free microcentrifuge tube and pre-washed twice with 100 μL 2x

binding buffer (40 mM TrisHCl, 2 M LiCl, 4 mM EDTA, pH7.5). Total RNA was made up to

a volume of 50 μL with nuclease-free water. The beads were resuspended in 50 μL of 2x

binding buffer and then mixed with the total RNA. The mix was denatured for 5 min at

Nature Genetics: doi:10.1038/ng.3495

Page 22: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

20

65 °C then cooled to 4 °C. The mix was then removed and left for 5 min at room

temperature to facilitate binding. The supernatant was removed and kept. The beads were

washed twice with 200 μL wash buffer (10 mM TrisHCl, 0.15 M LiCl, 1 mM EDTA, pH7.5),

pipetting up and down to mix thoroughly. A 50 μL volume of elution buffer (10 mM TrisHCl,

pH7.5) was then added to the beads and the sample heated to 80 °C for 2 min, then

cooled to 25 °C. 50 μL of 2x binding buffer was added and the binding and wash steps

were repeated once more (to improve the specificity of mRNA binding). For the final

elution step, 17 μL of elution buffer was mixed with the beads and the sample heated to

80 °C for 2 min, the tube was then placed on the magnetic rack immediately and the

supernatant, containing mRNA, was removed carefully to avoid uptake of beads. Isolated

mRNA was quantified using an Agilent 2100 Bioanalyser RNA Pico Chip (Agilent

Technologies).

Acoustic mRNA fragmentation. Isolated mRNA samples were sheared to a suitable

fragment size using Covaris Adaptive Focused Acoustics technology. The mRNA was

diluted to a final volume of 120 μL, which was then added to a Crimp Cap microTUBE

(Covaris, Inc.). The following program was run on the sealed sample tube; Duty cycle:

10%; Intensity: 5; Cycles per burst: 200; Time: 60 s; Temperature: 4 °C. The sample was

transferred to a new nuclease-free 1.5 mL microfuge tube and 1 μL of GlycoBlue and

800 μL of isopropanol were added to the sample. The tube was placed at -80 °C for 2 h,

then centrifuged at 13,000 g for 20 min at 4 °C in a pre-cooled centrifuge to pellet

precipitated RNA. Supernatant containing isopropanol was removed and the pellet was

washed twice with 70% (v/v) ethanol (in nuclease-free water). The pellet was dried briefly

at room temperature before resuspension in 10 μL nuclease-free water.

Reverse transcription and multiplexed Illumina library preparation. Reverse transcription

was performed using all of the sample with 100 ng random hexamers and SuperScript III

reverse transcriptase (Invitrogen) following the manufacturer‟s instructions. Second strand

synthesis was performed with NEBNext mRNA Second Strand Synthesis Module (New

England Biolabs, Inc.) following the manufacturer‟s instructions. The reaction was cleaned

using 1.8x Agencourt AMPure XP beads (Beckman Coulter, Inc.), eluting in nuclease-free

water. NEBNext End Repair Module (New England Biolabs, Inc.) was used to ensure blunt

ended DNA with 5‟-phosphate and 3‟-hydroxyl and the reaction again cleaned with 1.8x

Agencourt AMPure XP beads. The NEBNext dA-Tailing Module was used to add dAMP to

the 3‟ end of the fragments, to prevent self ligation and promote ligation to sequencing

adaptors. Again the reaction was cleaned with 1.8x Agencourt AMPure XP beads and

resuspended in 20 μL nuclease-free water. The following was added to the sample to

ligate sequencing adaptors: 25 μL of DNA T4 ligase buffer (New England Biolabs, Inc.), 1

μL 33 μM PCR-free paired-end duplex-indexed adaptors17 from integrated DNA

technologies and 4 μL T4 DNA ligase. A different indexed adaptor was used for each

sample to allow for multiplex sequencing. The ligation reaction was incubated at 25 °C for

15 min. The ligation reaction was cleaned with 0.8x Agencourt AMPure XP beads, size-

Nature Genetics: doi:10.1038/ng.3495

Page 23: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

21

selecting to remove adaptor-dimers. The finished library was quantified using an Agilent

2100 Bioanalyser High Sensitivity DNA Chip (Agilent Technologies).

Quantification and sequencing. Libraries were quantified using a Library Quantification Kit

– Illumina/Universal (KAPA biosystems) and the StepOnePlus Real-Time PCR System

(Applied Biosystems). Libraries were sequenced on the Illumina HiSeq 2000 or MiSeq

following the manufacturer‟s instructions for standard clustering and sequencing protocols

for 100 bp or 150 bp paired-end reads.

S. venezuelensis

RNA extraction. Total RNA was extracted using TRI reagent according to the

manufacturer‟s instructions after processing the nematode material in a freeze-crushing

apparatus (SK Mill, Tokken).

Library preparation. RNA libraries were prepared with an Illumina TruSeq RNA Sample

Preparation Kit and sequenced on an Illumina HiSeq 2000 sequencer following the

manufacturer‟s recommended protocol to produce 101 bp paired-end reads.

S. stercoralis

We used previously published RNA-seq data for parasitic females, free-living females and

iL3s4. The libraries and other sources for these RNA-seq data are shown in

Supplementary Table 24.

18. Transcriptome differential expression analysis

At least two biological replicates were used for each stage of the life cycle analyzed, with

the exception of the iL3 stage of S. ratti where only one biological sample was available

(ERS09250). This particular sample was also sequenced with 150 bp paired-end reads. To

make the data comparable, these reads were clipped to 100 bp. Reads were mapped to

either the S. ratti or S. stercoralis genome sequence using TopHat v242 with the following

parameters: maximum intron length (-l) = 10000, expected inner distance between mate

pair reads (-r) = 150, and max-multi hits (-g) = 1 (i.e. where a read could be matched to

multiple loci, it was randomly mapped to only one). Reads with mapping quality less than

30 were removed (SAMtools79 view -q 30). The S. ratti iL3 sample comprised 11.7 million

reads, whereas the other samples had 76-142 million reads. To avoid the problem of

observing counts for many genes in parasitic female and free-living female samples that

had zero counts in iL3, the number of reads for the parasitic female and free-living female

were reduced to the number produced for the iL3 sample by randomly down-sampling

(using SAMtools -s) each sample to c.10 million reads. Reads mapping per gene were

determined using BEDtools80 coverage (v2.17.0). The same approach was applied to S.

stercoralis RNA-seq data although no libraries were down-sampled.

Nature Genetics: doi:10.1038/ng.3495

Page 24: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

22

Differentially expressed genes in all pairwise comparisons (parasitic vs. free-living;

parasitic vs. iL3; and iL3 vs. free-living) were determined for S. ratti and S. stercoralis

using the R (version 3.1.2) package edgeR81 version 3.6.8. Genes with less than one

count per million reads were excluded. Genes were considered to be differentially

expressed if the false discovery rate (FDR) was ≤0.001 and the fold change was ≥ 2.

A three-way comparison was made between the transcriptomes of parasitic females, free-

living females and iL3s. Genes were considered to be upregulated in one stage of the life

cycle if they were upregulated in that one stage compared with both other stages; e.g.

genes significantly upregulated in parasitic females compared to free-living females and

iL3s, were considered to be upregulated in parasitic females. These datasets were used

for all analyses, unless stated otherwise when pairwise comparison data were used.

Enrichment of Compara families in expression data

The hypergeometric test was used to determine enrichment of Compara gene families

among genes identified as upregulated in particular comparisons among the three life-

cycle stages. The number of members of each Compara gene family that occurred in lists

of upregulated genes or proteins was compared with the random, null expectation using

the hypergeometric test implemented in the R function phyper with P value correction for

multiple hypothesis testing using the Benjamini-Hochberg approach as implemented in the

R function p.adjust. A P ≤ 0.01 was considered significant.

19. Identification of gene clusters

Bespoke Python scripts were used to identify and analyze genes that were arranged in

physically adjacent clusters of three or more genes within the S. ratti and S. stercoralis

genomes. Clusters that spanned both coding strands were included and three types of

cluster were considered: (i) genes that are co-expressed i.e. upregulated in the same

stage of the life cycle; (ii) co-expressed and with at least 50% of the cluster members

belonging to the same Compara gene family, and (iii) genes that simply belong to the

same Ensembl Compara gene family, irrespective of their timing of gene expression.

The number of clusters expected by chance, assuming a random distribution of genes,

was calculated by randomly selecting n genes from the genome and calculating the

number of clusters present among them. This was calculated where n = the number of

genes upregulated in a given stage of the life cycle; this was repeated 1000 times and the

mean value calculated. Empirical P values were obtained by counting the number of times

the actual observed number of clusters was higher than the number seen in the 1000

randomizations.

The number of clusters and the number of clusters with a common gene family were

compared between the three life stages (parasitic females, free-living females and iL3s)

using a Fisher‟s exact statistic with Bonferroni correction.

Nature Genetics: doi:10.1038/ng.3495

Page 25: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

23

20. Astacin-like metallopeptidase and SCP/TAPS coding genes

Putative astacin-like metallopeptidases were identified with Interproscan61 (based on

matches to InterPro entry IPR001506 „Peptidase M12A, astacin domain‟) and searches of

the MEROPS peptidase database82 (based on matches to family M12A „Astacin‟). To

establish evolutionary relationship of astacins in sequenced nematodes, the active

protease domains (defined by Pfam:PF01400) of all 14 nematode species above 152 bp

(80% of the published astacin domain sequence) were gathered. The first discovered

astacin, from the noble crayfish Astacus astacus (EMBL:X95684), was included in the

analysis as an outgroup.

Genes encoding SCP/TAPS were identified by running InterProScan across all gene

predictions and searching for CAP domain annotations within the output (IPR014044 „CAP

domain‟ and Pfam:PF00188 „CAP‟). This was carried out for all 14 species used in this

study (six clade IV species and eight outgroup species).

For each gene family dataset, the amino acid sequences were aligned in MAFFT63. The

alignments were edited with TCS83 using a weighted option and the distance matrix of the

new alignment was calculated using ProtTest84. The phylogenetic tree with the best

distance matrix model was constructed by maximum likelihood using RAxML85 with 100

bootstrap replicates.

For S. ratti, we identified 139 astacin coding genes that are Strongyloides-specific, since

they belong to a clade of the phylogenetic tree (Main Text, Figure 4) that contains only S.

ratti astacins coding gene copies. These S. ratti-species-specific astacins were located on

chromosomes II and X, but not on chromosome I. Another 27 copies appear to be shared

across nematodes, since they belong to a clade that also contains astacins from other

nematodes. Pfam was used to identify the domain combinations found in these copies

(Supplementary Table 9). Similarly, we identified 81 Strongyloides-specific SCP/TAPS

genes, and eight that are shared across nematodes, and identified their domain

combinations using Pfam (Supplementary Table 9).

21. Proteome analysis

Protein extraction and mass spectrometry (MS) analysis.

Protein was extracted from the worms by freeze-thawing and grinding the worms in 0.1%

(w/v) Rapigest (Waters). Samples were heated at 80 °C for 10 min, reduced with 3 mM

dithiothreitol (Sigma) at 60 °C for 10 min then alkylated with 9 mM iodoacetamide (Sigma)

at room temperature for 30 min in the dark. Proteomic grade trypsin (Sigma) was added at

a protein : trypsin ratio of 50:1 and samples incubated at 37 °C overnight. Rapigest was

removed by adding trifluoroacetic acid to a final concentration of 1% (v/v) and incubating

at 37 °C for 2 h. Peptide samples were centrifuged at 12,000 g for 1 h at 4 °C to remove

Nature Genetics: doi:10.1038/ng.3495

Page 26: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

24

precipitated Rapigest.

Peptide mixtures (1 µg in 2 µL) were analyzed by on-line nanoflow liquid chromatography

using the nanoACQUITY-nLC system (Waters MS technologies, Manchester, UK) coupled

to an LTQ-Orbitrap Velos (ThermoFisher Scientific, Bremen, Germany) mass spectrometer

equipped with the manufacturer‟s nanospray ion source. The analytical column

(nanoACQUITY UPLCTM BEH130 C18 15 cm x 75 µm, 1.7 µm capillary column) was

maintained at 35 °C and a flow-rate of 300 nL/min. The gradient consisted of 3-40%

acetonitrile in 0.1% (v/v) formic acid for 90 minutes then a ramp of 40-85% acetonitrile in

0.1% (v/v) formic acid for 5 min. Full scan MS spectra (m/z range 300-2000) were acquired

by the Orbitrap at a resolution of 30,000. Analysis was performed in data dependent

mode. The 20 most intense ions from MS1 scan (full MS) were selected for tandem MS by

collision induced dissociation (CID) and all product spectra were acquired in the LTQ ion

trap. There were two biological replicates available for parasitic and free-living samples,

and each of these was run three times to enable quantification.

Data analysis

Proteins were identified and relatively quantified by analyzing the .raw files using

Progenesis QI (version 2.0, Nonlinear Dynamics). Replicate LC-MS runs were aligned

using the default settings and an auto-selected run as a reference. Peaks were picked by

the software using default settings and filtered to include only peaks with a charge state of

between +2 and +6. Data were ranked and the top five peaks for each peptide identified.

Peptide intensities of replicates were normalized against all peptides identified by

Progenesis QI. Spectral data were transformed to .mgf files with Progenesis QI and

exported for peptide identification using Mascot (version 2.3.02, Matrix Science) where

tandem MS data were searched against the S. ratti predicted proteome. Mascot search

parameters were as follows; precursor mass tolerance set to 10 ppm and fragment mass

tolerance set to 0.8 Da. One missed tryptic cleavage was permitted.

Carbamidomethylation (cysteine) was set as a fixed modification and oxidation

(methionine) set as a variable modification. Mascot search results were further processed

using the machine-learning algorithm Percolator. The false discovery rate was < 1%.

Individual ion scores of greater than 13 indicated identity or extensive homology (P <

0.05).

Only proteins that contained at least two unique peptides were included in the downstream

analysis. Protein abundance (iBAQ) was calculated as the sum of all the peak intensities

(from Progenesis output) divided by the number of theoretically observable tryptic

peptides, taking into account protein length86. Protein abundance was normalized by

dividing the protein iBAQ value by the summed iBAQ values for that sample. The reported

abundance is the mean of the replicates.

Comparison of parasitic and free-living female females

Nature Genetics: doi:10.1038/ng.3495

Page 27: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

25

Two biological samples were collected for parasitic and free-living females. For each

biological sample, three technical repeats were run through the MS. Progenesis used the

summed peptide ion intensities from all six samples (two biological repeats x three

technical repeats) to calculate the relative fold change between the proteomes of parasitic

and free-living females. Significant differences were analyzed statistically by ANOVA. Q-

values are adjusted P values and were calculated by Progenesis based on an optimized

false discovery (FDR) approach. A protein was considered to be differentially expressed if

q < 0.05.

Data analysis for the excretory/secretory (ES) proteome

Raw spectral files, including those previously used in87 and additional data collected as

described by87 were converted to mgf files with MSConvert (ProteoWizard88). Converted

mgf files were searched against the S. ratti predicted proteome using Mascot (version

2.3.02, Matrix Science) search engine, where MS/MS files from all slices of the same gel

were merged into a single search. Mascot search parameters were as follows: precursor

mass tolerance set to 1.5 Da and fragment mass tolerance set to 0.8 Da. One missed

tryptic cleavage was permitted. Carbamidomethylation (cysteine) was set as a fixed

modification and oxidation (methionine) set as a variable modification. Mascot search

results were further processed using the machine-learning algorithm Percolator. The false

discovery rate was < 1% and individual ion scores of greater than 13 indicated identity or

extensive homology (P < 0.05). Protein quantitation was calculated by Mascot built-in

algorithm emPAI89. Only proteins with a minimum of two significant peptides were used for

further analysis.

22. S. ratti intrachromosomal homogeneity

Tandem and inverted repeats in the S. ratti assembly were identified using tandem repeat

finder90 and inverted repeat finder91, respectively. The repeat content was calculated in 10

kb non-overlapping windows using BEDTools80. To find S. ratti genes with significant

sequence alignments to yeast genes, we downloaded all Saccharomyces cerevisiae open

reading frames from the Saccharomyces Genome Database92, and found all S. ratti

predicted proteins that had BLASTP34 alignments (E-value <10-50) to these.

Nature Genetics: doi:10.1038/ng.3495

Page 28: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

26

SUPPLEMENTARY TABLES

Nature Genetics: doi:10.1038/ng.3495

Page 29: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

27

Supplementary Table 1. Properties of the (a) genome assemblies and (b) predicted gene sets of four species of Strongyloides, Parastrongyloides trichosuri and Rhabditophanes sp. and eight outgroup species. All genome statistics (a) apart from repeat content, are based on scaffolds that are at least 1000 bp in size (except M. hapla, which is based on contigs). Repeat content is based on all scaffolds. Gene statistics (b) are based on all scaffolds. The eight outgroup species are: Caenorhabditis elegans, Necator americanus, Meloidogyne hapla, Trichinella spiralis, Ascaris suum, Brugia malayi, Bursaphelenchus xylophilus and Trichuris muris. (a)

Cla

de

Ch

rom

os

om

es

b

As

se

mb

ly v

ers

ion

As

se

mb

led

siz

e

(Mb

)

Nu

mb

er

of

sc

aff

old

s

Sc

aff

old

N50

e (

kb

)

N5

0e (

nu

mb

er)

Sc

aff

old

N90

e (

kb

)

N9

0e (

nu

mb

er)

Lo

ng

est

scaff

old

(Mb

)

G+

C c

on

ten

t (%

)

Se

qu

en

ce

co

ve

rag

e

(% n

ot

ga

p)

Re

pea

t c

on

ten

t (%

)f

S. ratti IV 3 V5.0.4 43.1 115 11,700 2 748 7 16.8 21 99.41 5.9

S. stercoralis IV 3 V2.0.4 42.6 675 431 16 96 89 5.0 22 99.97 9.5

S. papillosus IV 2 V2.1.4 60.2 4,353 86 129 4 1,763 1.7 26 99.89 16.7

S. venezuelensis IV 2 V2.0.4 52.1 520 715 16 115 83 5.9 25 99.92 11.2

S. venezuelensis IV 2 V3d 55.5 653 2,127 5 81 82 15.1 25 93.80 10.5

P. trichosuri IV 3 V2.0.4 42.2 1,391 837 12 16 173 6.2 31 99.37 4.7

Rhabditophanes IV 5c V2.0.4 47.2 380 537 22 81 101 7.3 32 99.69 5.1

C. elegans V 6 WS244 100.2 7 17,500 3 13,800 6 20.9 36 100.00 21.9

N. americanus V 6 WS244 244.1 11,712 213 283 29 1,336 1.9 34 85.29 39.2

M. haplaa IV 16 WS244 52.9 3,389 38 372 7 1,599 0.36 27 n/a1 31.7

T. spiralis I 3 WS244 61.1 3,853 7600 3 6 266 12.0 31 91.84 32.2

A. suum III 12 WS244 266.1 2,414 419 171 98 680 3.8 37 97.21 9.5

B. malayi III 5 WS244 94.0 9,755 191 131 2 2,410 5.2 27 91.75 25.9

B. xylophilus IV 6 V1.2 73.1 3,555 988 21 34 102 3.6 40 97.98 16.4

T. muris I 3 V2.1 84.5 1,459 401 59 81 227 1.8 45 99.36 33.0

Nature Genetics: doi:10.1038/ng.3495

Page 30: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

28

(b)

C

eg

ma

g

co

mp

lete

ne

ss

(%

)

Av

era

ge

CE

G

ge

ne

nu

mb

erg

Nu

mb

er

of

ge

ne

s

Nu

mb

er

of

ge

ne

s

pe

r M

b

Me

an

pro

tein

len

gth

(a

a)

Me

dia

n p

rote

in

len

gth

(a

a)

Nu

mb

er

of

ex

on

s

Ex

on

s c

om

bin

ed

len

gth

(M

b)

Me

an

nu

mb

er

of

ex

on

s p

er

ge

ne

Me

an

exo

n le

ng

th

(bp

)

Me

dia

n e

xo

n

len

gth

(b

p)

Nu

mb

er

of

intr

on

s

Me

an

in

tro

n

len

gth

(b

p)

Me

dia

n in

tro

n

len

gth

(b

p)

S. ratti 99.6 99.6 1.11 1.11 12,451 289 468 362 33,796 17.5 2 519 263 21,345 188 52

S. stercoralis 99.6 99.6 1.13 1.13 13,098 307 456 350 34,366 17.9 2 522 265 21,268 196 51

S. papillosus 99.2 99.6 1.14 1.16 18,457 307 404 314 40,821 22.4 2 549 304 22,364 143 48

S. venezuelensis 95.6 95.6 1.12 1.13 16,904 324 400 306 40,619 20.3 2 500 261 23,715 207 50

P. trichosuri 98.8 98.8 1.47 1.54 15,010 356 460 354 35,049 20.8 2 592 348 20,039 179 48

Rhabditophanes 99.6 99.6 1.17 1.19 13,496 286 438 326 37,987 17.8 2 467 276 24,491 261 48

C. elegans 100 100 1.06 1.07 23,629 204 396 328 145,275 30.1 5 209 146 169,506 207 66

N. americanus 89.5 94.4 1.04 1.09 19,153 78 268 191 122,794 15.5 5 126 112 121,646 643 141

M. hapla 94.8 96.8 1.09 1.13 14,420 273 348 250 88,160 15.1 4 171 144 73,740 153 53

T. spiralis 96.4 96.4 1.12 1.15 16,380 268 318 192 87,853 15.7 4 178 129 71,473 198 83

A. suum 94.0 99 1.13 1.15 18,542 70 327 233 119,166 18.2 5 153 138 100,624 1023 690

B. malayi 96.8 97.2 1.08 1.12 18,001 149 329 217 138,488 20.0 5 145 133 120,487 331 223

B. xylophilus 97.6 98.4 1.08 1.09 21,058 242 378 295 115,135 25.7 4 223 178 94,077 266 75

T. muris 95.6 96.8 1.07 1.07 10,935 129 414 288 82,886 13.6 5 164 117 71,951 435 58

a. M. hapla is an unscaffolded assembly, and therefore contig statistics are reported here.

b. Chromosome numbers previously reported: S. ratti93, S. stercoralis94, S. papillosus95, S. venezuelensis7, P. trichosuri10, C. elegans96, N. americanus97,

M. hapla98, T. spiralis99, A. suum100, B. malayi101, B. xylophilis102, T. muris103.

c. See Supplementary Figure 7.

d. S. venezuelensis V2.0.4 was used throughout this study except for the synteny analysis against S. ratti, for which S. venezuelensis V3 genome

assembly was used. Gene models were not predicted on S. venezuelensis V3.

Nature Genetics: doi:10.1038/ng.3495

Page 31: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

29

e. With scaffolds listed by size, N50 and N90 refer to the sizes above which 50 and 90% of the assembled bases are distributed, respectively. The

number of scaffolds corresponding to 50% and 90% of assembled bases are indicated by N50 (number) and N90 (number), respectively

f. The repeat content estimates include simple repeats. For the six species from this study, the estimates were based on RepeatMasker output (see

Supplementary Note 8), while for the other species soft-masked versions of the assemblies were downloaded from WormBase53, and the percent of the

assembly that is masked was calculated.

g. Assembly completeness was estimated by CEGs (Core Eukaryotic Genes) with the CEGMA v2 software35. Complete and partial figures are shown in

left and right columns, respectively.

Nature Genetics: doi:10.1038/ng.3495

Page 32: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

30

Supplementary Table 2. Intron size and characteristics in nematodes. For the six genomes presented here (shown in bold) and for eight outgroup species characteristics of their genomes including intron characteristics. This shows that for P. trichosuri and the four species of Strongyloides that less than 10% of their genome is intronic.

Species

Cla

de

Ge

no

me

as

se

mb

ly s

ize

(bp

)

Nu

mb

er

of

ge

ne

s

Pro

po

rtio

n o

f

ge

no

me

inte

rge

nic

(%

)

Nu

mb

er

of

co

din

g e

xo

ns

Pro

po

rtio

n o

f

ge

no

me

ex

on

ic

(%)

Med

ian

ex

on

len

gth

(b

p)

Nu

mb

er

of

intr

on

s

Pro

po

rtio

n o

f

ge

no

me

in

tro

nic

(%)

Med

ian

in

tro

n

len

gth

(b

p)

Mean

in

tro

n

len

gth

(b

p)

S. rattia IV 43,150,242 12,451 50.1 33,796 40.6 263 21,345 9.3 52 188

S. stercoralisa IV 42,674,651 13,098 48.2 34,366 42.1 265 21,268 9.7 51 196

S. papillosusa IV 60,448,214 18,457 57.6 40,821 37.1 304 22,364 5.3 48 143

S. venezuelensisa IV 52,178,999 16,904 51.7 40,619 39.0 261 23,715 9.4 50 207

P. trichosuria IV 42,486,966 15,010 42.7 35,049 48.9 348 20,039 8.4 48 179

Rhabditophanesa IV 47,267,908 13,496 48.9 37,987 37.6 276 24,491 13.5 48 261

C. elegans V 100,286,401 23,629 9.8 145,275 30.0 146 169,506 60.2 66 207

N. americanus V 244,075,060 19,153 77.9 122,794 6.3 112 121,646 15.7 141 643

M. hapla IV 53,017,507 14,420 50.2 88,160 28.5 144 73,740 21.3 53 153

T. spiralis I 63,525,422 16,380 53.1 87,853 24.6 129 71,473 22.2 83 198

A. suum III 272,782,664 18,542 55.6 119,166 6.7 138 100,624 37.7 690 1023

B. malayi III 94,062,924 18,001 36.3 138,488 21.3 133 120,487 42.4 223 331

B. xylophilus IV 74,561,461 21,058 32.0 115,135 34.5 178 94,077 33.5 75 266

T. muris (v2b) I 84,674,602 10,935 47.0 82,886 16.1 117 71,951 36.9 58 435

a. The version of the genome assembly used for this analysis was: v2 for Rhabditophanes sp., P. trichosuri and S. stercoralis; v2.1 for S. papillosus; v2.2

for S. venezuelensis and v5 for S. ratti. The version of the genome assembly used for outgroup species are shown in Supplementary Table 1.

Nature Genetics: doi:10.1038/ng.3495

Page 33: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

31

Supplementary Table 3. Gene synteny between S. ratti and three species of Strongyloides and Parastrongyloides trichosuri. Each scaffold from each species was first assigned to a S. ratti chromosome based on pairwise unique nucleotide alignments using Nucmer22 including scaffolds of ≥10 kb. The top S. ratti nucleotide alignment did not always cover the whole of a scaffold from one of the other four species, suggesting either intra- or inter-chromosomal rearrangement. To investigate this, we used DAGchainer68 to identify syntenic blocks containing ≥3 orthologous genes in the same order and orientation, between S. ratti and the relevant Strongyloides or Parastrongyloides species. The table shows the total number of genes in syntenic blocks of two types: (i) syntenic blocks that matched the same chromosome as the Nucmer alignment („Genes in intra-chromosomal synteny blocks‟), and (ii) syntenic blocks that matched a different chromosome to the Nucmer alignment („Genes in inter-chromosomal synteny blocks‟). For S. ratti and each of the species of Strongyloides and P. trichosuri, the intra-chromosomal synteny blocks are genomic regions that have not undergone any inter-chromosomal rearrangements, only intra-chromosomal rearrangements (if any). For each species pair, the majority of genes are on intra-chromosomal synteny blocks. For example, there were 2,556 S. venezuelensis genes in intra-chromosomal syntenic blocks with S. ratti chromosome I, i.e. these genes lie on scaffolds whose top alignment is with S. ratti chromosome I. Similarly, there were 4,160 S. venezuelensis genes in intra-chromosomal blocks with S. ratti chromosome II, and 1,715 in intra-chromosomal syntenic blocks with S. ratti chromosome X. The inter-chromosomal synteny blocks are cases of inter-chromosomal rearrangement, since the syntenic block matches a different S. ratti chromosome compared to the top Nucmer alignment of the scaffold. For example, there were 303 S. venezuelensis genes in blocks with S. ratti chromosomes I (or X), but where genes lie on scaffolds whose top alignment was with S. ratti chromosome X (or I). In other words, S. venezuelensis scaffolds aligned to S. ratti chromosome I, often contained syntenic blocks matching chromosome X; and vice versa for chromosomes X and I. Using a one-sided binomial test, we find that the number of genes in such „I and X‟ syntenic blocks (compared to genes in „II and X‟, or „I and II‟ blocks) in S. venezuelensis and in S. papillosus is greater than expected by chance. This is likely due to the X-I chromosomal fusion in the S. venezuelensis-S. papillosus lineage, followed by intra-chromosomal rearrangements in the X-I fusion chromosome.

Nature Genetics: doi:10.1038/ng.3495

Page 34: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

32

Genes in intra-chromosomal synteny blocks Genes in inter-chromosomal synteny blocks

I II X I and Xb Restc P valued

S. stercoralis (n=3)a 3,228 4,841 2,017 67 170 0.96

S. venezuelensis (n=2) 2,556 4,160 1,715 303 134 <0.001

S. papillosus (n=2) 3,069 4,602 2,005 33 10 <0.001

P. trichosuri (n=3) 2,740 3,453 1,856 48 767 1

a. n = haploid chromosome number.

b. „I and X‟ means that a syntenic block with S. ratti chromosome I lies on a scaffold that has its top nucleotide alignment to S. ratti chromosome

X, or that a syntenic block with S. ratti chromosome X lies on a scaffold that has its top nucleotide alignment to S. ratti chromosome I.

c. „II and X‟ and „I and II‟.

d. One-sided binomial test of whether the number of I-X inter-chromosomal syntenic genes differed from the null expectation of one third (i.e.

those I and X, II and X, and I and II).

Nature Genetics: doi:10.1038/ng.3495

Page 35: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

33

Supplementary Table 4. Chromosomal regions that undergo chromatin diminution or belong to the X chromosome. Analysis of sex-specific read depth to identify chromosomal regions that undergo chromatin diminution in S. papillosus, and belong to the X chromosome in S. ratti, S. stercoralis and P. trichosuri. Part (a) is for S. ratti, (b) for S. stercoralis, (c) for S. papillosus and (d) for P. trichosuri. The median coverage in females (or males, or mixed-sex L2s) is given as a fraction of the median autosomal coverage in females (males, or mixed-sex L2s). The names of scaffolds/chromosomes that are inferred to belong to diminished or X regions are in red, and those inferred to belong to non-diminished or autosomal regions are in blue. Supplementary Table 4 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 36: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

34

Supplementary Table 5. The use of genetic markers to identify regions of chromatin diminution in S. papillosus. For each S. papillosus genetic marker (a sequence tagged site), we experimentally determined whether the marker was diminished in S. papillosus6,12 (and unpublished data). For some markers we also determined on which S. papillosus chromosome the marker was located. The markers were subsequently computationally mapped to scaffolds in the S. papillosus assembly. Six were identified as diminished and 19 as non-diminished. The scaffolds identified as diminished or non-diminished based on read depth analysis (Supplementary Table 4; column F) agreed with those identified based on genetic markers, providing validation of the read depth approach.

Scaffold to which the marker was

computationally mapped

Genetic marker

Chromosome carrying markera

Marker diminished? –

genetic analysis

Scaffold diminished? – read

depth analysisb

SPAL_scaffold0000018 ytp1 II No No

SPAL_scaffold0000001 ytp2 II No No

ytp46 II No

SPAL_scaffold0000015 ytp3 I No No

SPAL_contig0000008 ytp4 Not Done No No

SPAL_scaffold0000043 ytp5 Not Done No No

SPAL_scaffold0000003 ytp14 II No No

ytp12 II No

SPAL_scaffold0000056 ytp8 Not Done No No

SPAL_scaffold0000044 ytp9 Not Done No No

SPAL_scaffold0000065 ytp10 I No No

SPAL_contig0000002 ytp48 Not Done No No

ytp11 Not Done No

SPAL_scaffold0000041 ytp13 Not Done No No

SPAL_scaffold0000011 ytp15 I No No

SPAL_contig0000011 ytp44 I No No

SPAL_scaffold0000005 ytp49 Not Done No No

SPAL_contig0000043 ytp126 II No No

SPAL_scaffold0000047 ytp127 II No No

SPAL_contig0000006 ytp128 I No No

SPAL_contig0000184 ytp129 I No No

SPAL_scaffold0000059 ytp131 II No No

SPAL_scaffold0000019 ytp50 I Yes Yes

ytp83 I Yes

SPAL_scaffold0000035 ytp85 Not Done Yes Yes

ytp84 I Yes

SPAL_scaffold0000009 ytp86 I Yes Yes

SPAL_scaffold0000028 ytp133 I Yes Yes

SPAL_contig0000141 ytp134 I Yes Yes

SPAL_scaffold0000013 ytp135 I Yes Yes

a. S. papillosus chromosome carrying the marker, determined genetically.

b. Read depth analysis from Supplementary Table 4.

Nature Genetics: doi:10.1038/ng.3495

Page 37: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

35

Supplementary Table 6. Diminished and non-diminished S. papillosus genes compared to S. ratti. (a) S. ratti X chromosome genes that are non-diminished in S. papillosus that are in syntenic blocks between the species; (b) S. ratti X chromosome genes that are non-diminished in S. papillosus that are not in syntenic blocks between the species; and (c) S. papillosus diminished genes with one-to-one orthologs on S. ratti autosomes. The orthologs in (b) and (c) are not in regions of conserved synteny. We considered that an ortholog pair was strongly supported by the phylogenetic tree if the Strongyloides/Parastrongyloides/Rhabditophanes clade gene tree follows the species tree for the clade. In (a) S. papillosus scaffolds SPAL_scaffold0000059 and SPAL_contig0000009 are both inferred to be non-diminished from the read depth analysis (Supplementary Table 4), but contain genes with one-to-one orthologs on S. ratti chromosome X. Dosage of these genes has therefore changed since the species diverged. Apart from the three genes on S. papillosus SPAL_scaffold0000059 that have orthologs on S. ratti chromosome X, another 12 genes have orthologs on S. ratti chromosome II, suggesting this scaffold is part of S. papillosus chromosome II. In agreement with this, S. papillosus gene SPAL_0001251300 on this scaffold is orthologous to S. ratti X-gene SRAE_X000126600, which was previously reported to be orthologous to a gene on S. papillosus chromosome II based on genetic markers (Table 1 in12, gene tli-1). S. papillosus SPAL_contig0000009 has 33 genes with orthologs on S. ratti X, and 20 with orthologs on S. ratti I, so is likely to be part of the S. papillosus X-I fusion chromosome. (a) Non-diminished S. papillosus genes in syntenic blocks

S. ratti gene S. ratti gene product description

S. papillosus ortholog

Chromosomal location of S.

papillosus ortholog

S. papillosus ortholog

supported by tree?

C. elegans orthologa

SRAE_X000126500 Transmembrane protein 17 SPAL_0001251200 SPAL_scaffold0000059 Yes None

SRAE_X000126600 Toll-interacting protein SPAL_0001251300 SPAL_scaffold0000059 Yes tli-1

SRAE_X000201800 JNK-interacting protein 1 SPAL_0001306100 SPAL_contig0000009 Yes jip-1

SRAE_X000201900 Major facilitator superfamily; Major facilitator superfamily domain, general substrate transporter; Major facilitator superfamily domain-containing protein

SPAL_0001306200 SPAL_contig0000009 Yes mct-1, mct-2, mct-3

SRAE_X000202000 Ceramide glucosyltransferase SPAL_0001306300 SPAL_contig0000009 Yes cgt-1, cgt-2, cgt-3

SRAE_X000202100 EF-hand domain; EF-hand domain pair-containing

SPAL_0001306400 SPAL_contig0000009 Yes calu-1

Nature Genetics: doi:10.1038/ng.3495

Page 38: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

36

protein

SRAE_X000202200 EB domain; Cysteine-rich repeat; Insulin-like growth factor binding protein, N-terminal domain-containing protein

SPAL_0001306500 SPAL_contig0000009 Yes None; homolog of M03F4.6

SRAE_X000202300 Helix-loop-helix protein 13 SPAL_0001306600 SPAL_contig0000009 Yes hlh-13

SRAE_X000202400 Hypothetical protein SPAL_0001306700 SPAL_contig0000009 Yes F56D3.1

SRAE_X000202500 Frag1/DRAM/Sfk1 family-containing protein

SPAL_0001306800 SPAL_contig0000009 Yes None; homolog of F11E6.6 and C33A11.2

SRAE_X000202600 Hypothetical protein SPAL_0001306900 SPAL_contig0000009 Yes None

SRAE_X000202700 Hypothetical protein SPAL_0001307000 SPAL_contig0000009 Yes C27D8.3

SRAE_X000202800 Adapter molecule Crk SPAL_0001307100 SPAL_contig0000009 Yes ced-2

SRAE_X000202900 Protein NPR-31 SPAL_0001307200 SPAL_contig0000009 Yes npr-31

SRAE_X000203000 Sodium/chloride cotransporter 3

SPAL_0001307300 SPAL_contig0000009 Yes kcc-3

SRAE_X000203100 Anchor cell fusion failure-1 SPAL_0001307400 SPAL_contig0000009 Yes aff-1

SRAE_X000203200 Hypothetical protein SPAL_0001307500 SPAL_contig0000009 Yes F10E9.10

SRAE_X000203300 Hypothetical protein SPAL_0001307600 SPAL_contig0000009 Yes Y53F4B.27

SRAE_X000203400 PAN-1 domain; Apple-like domain-containing protein

SPAL_0001307700 SPAL_contig0000009 Yes None

SRAE_X000203500 PAN-1 domain; Apple-like domain-containing protein

SPAL_0001307800 SPAL_contig0000009 Yes ZC449.1, ZC449.2

SRAE_X000203600 G protein-coupled receptor, rhodopsin-like family; Globin, structural domain; GPCR, rhodopsin-like, 7TM domain-containing protein

SPAL_0001307900 SPAL_contig0000009 Yes None

SRAE_X000203700 MBlk-1 Related factor-1 SPAL_0001308000 SPAL_contig0000009 Yes mbr-1

SRAE_X000203800 Hypothetical protein SPAL_0001308100 SPAL_contig0000009 Yes pgn-35

SRAE_X000203900 Hypothetical protein SPAL_0001308200 SPAL_contig0000009 Yes C05B5.4,

Nature Genetics: doi:10.1038/ng.3495

Page 39: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

37

R10E12.2

SRAE_X000204000 Calmodulin-like protein 3 SPAL_0001308300 SPAL_contig0000009 Yes cal-3

SRAE_X000204100 Ribosome control protein 1 domain-containing protein

SPAL_0001308400 SPAL_contig0000009 Yes R06F6.8

SRAE_X000204200 Glycoprotein-N-acetylgalactosamine 3-beta-galactosyltransferase 1

SPAL_0001308500 SPAL_contig0000009 Yes Y38C1AB.1, Y38C1AB.5

SRAE_X000204300 Protein of unknown function DUF273 family-containing protein

SPAL_0001308600 SPAL_contig0000009 Yes F32H2.8, F32H2.11

SRAE_X000204400 Hypothetical protein SPAL_0001308700 SPAL_contig0000009 Yes None

SRAE_X000204500 Dehydrogenase/reductase SDR family member on chromosome X

SPAL_0001308800 SPAL_contig0000009 Yes dhs-22

SRAE_X000204600 Zinc finger, C2H2 domain; Zinc finger C2H2-type/integrase DNA-binding domain; Zinc finger, C2H2-like domain-containing protein

SPAL_0001308900 SPAL_contig0000009 Yes egrh-1, egrh-2, egrh-3

SRAE_X000204700 InaD-like protein SPAL_0001309000 SPAL_contig0000009 Yes C52A11.3, mpz-1

SRAE_X000204800 Domain of unknown function DB domain-containing protein

SPAL_0001309100 SPAL_contig0000009 Yes dao-2

SRAE_X000204900 Protein BBS-1 SPAL_0001309200 SPAL_contig0000009 Yes bbs-1

SRAE_X000205100 Hypothetical protein SPAL_0001309400 SPAL_contig0000009 Yes None

Nature Genetics: doi:10.1038/ng.3495

Page 40: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

38

(b) Non-diminished S. papillosus genes that are not in syntenic blocks

S. ratti gene S. ratti gene product description

S. papillosus ortholog

Chromosomal location of S.

papillosus ortholog

S. papillosus ortholog

supported by tree?

C. elegans ortholog

SRAE_X000010800 Hypothetical protein SPAL_0001723700 SPAL_contig0000115 Yes tag-343

SRAE_X000089300 UDP-glucuronosyl/UDP-glucosyltransferase family-containing protein

SPAL_0000670300 SPAL_scaffold0000051 Yes 52 orthologs

SRAE_X000128200 Proteasome-associated protein ECM29 homolog

SPAL_0000004200 SPAL_scaffold0000041 Yes H04D03.3, D2045.2

SRAE_X000137800 Saccharopine dehydrogenase-like oxidoreductase

SPAL_0001187000 SPAL_scaffold0000038 Yes Y50D4B.2, F22F7.1, F22F7.2

SRAE_X000138000 Structural maintenance of chromosomes protein 4

SPAL_0000541100 SPAL_contig0000059 Yes smc-4, dpy-27

SRAE_X000140500 Rh30-like protein SPAL_0001739800 SPAL_contig0000005 Yes rhr-2

SRAE_X000143800 Astacin-like metalloendopeptidase

SPAL_0001595700 SPAL_scaffold0000008 Yes None

SRAE_X000153300 Solute carrier family 40 member 1

SPAL_0000322900 SPAL_scaffold0000002 Yes fpn-1.1

SRAE_X000153450 Proteinase inhibitor I2, Kunitz metazoa domain; EB domain; Cysteine-rich repeat; Lustrin, cysteine-rich repeated domain-containing protein

SPAL_0001251000 SPAL_scaffold0000059 Yes F30H5.3

SRAE_X000156500 CUB domain-containing protein

SPAL_0000807300 SPAL_contig0000156 Yes F16B12.1, F38B2.3

SRAE_X000157900 DNA polymerase kappa SPAL_0001621000 SPAL_scaffold0000003 Yes polk-1

Nature Genetics: doi:10.1038/ng.3495

Page 41: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

39

SRAE_X000163600 AMP-dependent synthetase/ligase domain-containing protein

SPAL_0001542300 SPAL_contig0000051 Yes acs-17

SRAE_X000174600 Histidine protein methyltransferase 1 homolog

SPAL_0001553800 SPAL_scaffold0000033 Yes K01A11.2

SRAE_X000179700 Hypothetical protein SPAL_0000892500 SPAL_scaffold0000022 Yes None

SRAE_X000225100 Hypothetical protein SPAL_0001171900 SPAL_contig0000081 Yes None

SRAE_X000232600 Reverse transcriptase domain; Aspartic peptidase domain-containing protein

SPAL_0001459700 SPAL_contig0000330 No None

SRAE_X000232700 Integrase, catalytic core domain; Ribonuclease H-like domain-containing protein

SPAL_0000043200 SPAL_contig0000744 No None

SRAE_X000246550 GPCR, rhodopsin-like, 7TM domain; 7TM GPCR, serpentine receptor class x (Srx) family-containing protein

SPAL_0000870150 SPAL_contig0000230 Yes None

SRAE_X000246600 Putative pyrroline-5-carboxylate reductase

SPAL_0000189200 SPAL_contig0000192 Yes F55G1.9

SRAE_X000248900 Peptidase S9, prolyl oligopeptidase, catalytic domain; Six-bladed beta-propeller, TolB-like domain-containing protein

SPAL_0000608400 SPAL_contig0000196 Yes dpf-4

SRAE_X000251200 Cyclin-C SPAL_0000800100 SPAL_contig0000234 Yes cic-1

Nature Genetics: doi:10.1038/ng.3495

Page 42: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

40

(c) Diminished S. papillosus genes

S. ratti gene S. ratti gene product description S. papillosus

ortholog Chromosomal location of S.

papillosus ortholog

S. papillosus ortholog

supported by tree?

C. elegans ortholog

SRAE_1000075800 Cholinesterase family; Carboxylesterase, type B domain-containing protein

SPAL_0001463700 SPAL_contig0000003 Yes None; homolog of ace-2

SRAE_2000474500 Sulfotransferase family-containing protein

SPAL_0001568900 SPAL_contig0000040 Yes F01D5.10, F25E5.2

SRAE_1000170800 Hypothetical protein SPAL_0001165000 SPAL_contig0000162 No None

SRAE_2000526800 Zinc finger, RING-type domain; Zinc finger, C6HC-type domain; Zinc finger, RING/FYVE/PHD-type domain-containing protein

SPAL_0000391100 SPAL_contig0000197 No F26F12.3

SRAE_1000300300 7TM GPCR, serpentine receptor class x (Srx) family-containing protein

SPAL_0001477800 SPAL_contig0000470 No None

SRAE_1000291600 Hypothetical protein SPAL_0000673600 SPAL_scaffold0000028 Yes None

SRAE_1000110300 Protein of unknown function DUF1647 family-containing protein

SPAL_0000078900 SPAL_scaffold0000064 Yes F32B4.1

SRAE_2000139700 Histone H4 family; Histone core domain; Histone-fold domain-containing protein

SPAL_0000079000 SPAL_scaffold0000064 No 16 orthologs

a. If there was no C. elegans ortholog(s) of the S. ratti gene, but there was a C. elegans gene in the gene family, these are listed here as homologs of the S. ratti gene.

Nature Genetics: doi:10.1038/ng.3495

Page 43: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

41

Supplementary Table 7. Mitochondrial genomes of Strongyloides spp., Parastrongyloides trichosuri and Rhabditophanes sp. The mitochondrial genomes of seven outgroup species are also shown (the mitochondrial genome sequence of Meloidogyne hapla is not available).

Species Clade Accession number Length (bp)

Coding strand

Genes encoding

Proteins rRNA tRNA

Strongyloides stercoralis IV LC050212 13751 + 12 2 22

Strongyloides ratti IV LC050211 16915 + 12 2 22

Strongyloides venezuelensis IV LC050213 15567 + 12 2 22

Strongyloides papillosus IV LC050210 14109 + 12 2 23

Parastrongyloides trichosuri IV LC050209 13699 + 12 2 22

Rhabditophanes sp. KR 3021 IV LC050214 (Molecule A) 9299 + 5 0 6

LC050215 (Molecule B) 9219 + 7 2 16

Bursaphelenchus xylophilus IV NC_023208 14778 + 12 2 22

Caenorhabditis elegans V NC_001328 13794 + 12 2 22

Necator americanus V NC_003416 13605 + 12 2 22

Ascaris suum III NC_001327 14284 + 12 2 22

Brugia malayi III NC_004298 13657 + 12 2 22

Trichinella spiralis I NC_002681 16706 +/- 13 2 22

Trichuris muris I LC050561 14105 +/- 13 2 22

Nature Genetics: doi:10.1038/ng.3495

Page 44: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

42

Supplementary Table 8. Compara gene families of the six species and eight outgroup species. (a) the gene identifiers for all members of each Compara gene family and (b) the distribution of each family among species is described. Some large and diverse gene families, such as the astacin-coding genes, are divided into multiple subfamilies by Compara. For each family the most common Interpro61 hit and product description among its members are shown, with numbers to indicate their frequency. In (b) columns G-T, inclusive, show the number of members of each family in each of the species listed; column B shows the total number of members of each family for all 14 species, and column C for the six species whose genomes are presented here. Supplementary Table 8 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 45: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

43

Supplementary Table 9. Protein domain combinations for S. ratti astacin-like metallopeptidases and SCP/TAPS coding genes. S. ratti (a) astacin-like metallopeptidases (b) and SCP/TAPS were grouped into sub-families based on the combination of protein domains they possess, as determined by Pfam, using an E-value cutoff of 1.0 and excluding Pfam-B domains. Genes were classified as either Strongyloides-specific, or shared across nematodes, based on phylogenetic analysis (see Supplementary Note 20). Genes in the astacin gene family expansion in the Strongyloides genus were located on chromosomes II and X, and none on chromosome I. Although our full data set of SCP/TAPS includes 89 genes (Figure 4, Main text), we did not identify Pfam domains in all of these SCP/TAPS genes. (a)

Domain combinations Number

Strongyloides specific

located on chromosome X Astacin 23

Astacin - CUB 1

Astacin - hEGF 8

located on chromosome II Astacin 68

Astacin - CUB 18

Astacin - hEGF 6

located on unplaced scaffolds Astacin 12

Astacin - CUB 3

Shared across nematodes

Astacin 13

Astacin - ShK 5

Astacin - ShK - ShK 2

Astacin - ShK - ShK -ShK 1

Astacin - CUB 2

Astacin - CUB - TSP_1 3

Astacin - CUB - Fxa - CUB - Fxa - CUB - CUB 1

(b)

Domain combinations Number

Strongyloides specific

located on chromosome X CAP 2

located on chromosome I CAP 8

located on chromosome II CAP 44

CAP – CAP - CAPa 1

PT - CAP 1

NDUFB10 – IP_trans – IP_trans - CAP 1

located on unplaced scaffolds CAP 6

Shared across nematodes

CAP 7

a. Although a single gene was predicted with three CAP domains, manual inspection revealed that the

prediction is an incorrect fusion of gene models. Proteins with three CAP domains are therefore unlikely.

Nature Genetics: doi:10.1038/ng.3495

Page 46: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

44

Supplementary Table 10. Astacin-like metallopeptidases and SCP/TAPS. The (a) astacin-like metallopeptidase and (b) SCP/TAPS coding genes found in four Strongyloides spp., P. trichosuri and Rhaditophanes sp. based on the identification of Pfam domains. Where transcriptome data were available (i.e. for S. ratti and S. stercoralis) genes upregulated in (i) parasitic females are shown in red; (ii) in iL3s are shown in green and (iii) in free-living adult females are shown in blue (edgeR, False Discovery Rate (FDR) <0.001, fold-change>2; Supplementary Table 13). This color highlighting refers to genes upregulated in one life cycle stage (e.g. parasitic females) compared to both other stages (e.g. free-living females and iL3). In (a) and (b) row 2 shows the total number of these genes for each species, which is summarised in (c). Supplementary Table 10 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 47: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

45

Supplementary Table 11. Novel gene families. (a) Nine novel gene families in Strongyloides spp. and Parastrongyloides and (b) the number of genes present in these gene families that were identified in the transcriptomes of S. ratti and S. stercoralis and the proteome of S. ratti. The presence of signal peptides and transmembrane helices were determined by InterProScan. G3DSA numbers refer to the Gene3D database. N-linked glycosylation sites were predicted genome-wide using ProSite. The distribution of the number of glycosylation sites or cysteine residues per gene in each sgpf gene family was compared to the distributions of these values for all protein sequences predicted from the S. ratti genome using the Wilcoxon rank-sum test (P <0.01). (a)

Members SRAE/SSTP/SVE/

SPAL/PTRKa

Mean length

(aa)

Members with

signal peptide

Members with TMb

helix

Modal number of

TMb helices

Significant difference in predicted N-

linked glycosylation

sites

Significant difference in

cysteine content

Protein domains

sgpf-1c 203 22/14/39/53/75 677 105 164 1 Yes (high) No -

sgpf-2 101 11/21/24/30/15 323 1 101 7 No Yes (high) -

sgpf-3 92 11/9/22/32/18 564 0 0 0 No Yes (high) G3DSA:3.80.10.10 (59 genes)

sgpf-4 87 12/10/24/40/1 538 0 0 0 No Yes (high) G3DSA:3.80.10.10 (10 genes)

sgpf-5 78 13/11/20/34/0 445 54 33 1 Yes (high) No G3DSA:3.90.190.10 (7 genes)

sgpf-6 69 0/0/17/52/0 514 15 56 1 Yes (high) No -

sgpf-7 17 17/0/0/0/0 178 12 14 1 Yes (high) No G3DSA:1.20.5.170 (1 gene)

sgpf-8 21 21/0/0/0/0 161 21 4 1 No Yes (low) -

sgpf-9 8 8/0/0/0/0 121 8 2 1 No Yes (low) -

Nature Genetics: doi:10.1038/ng.3495

Page 48: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

46

(b)

Genome S. ratti transcriptome S. stercoralis

transcriptome S. ratti somatic

proteome S. ratti ES proteome

S. ratti S. stercoralis Parasitic iL3 Free-living

Parasitic iL3 Free-living

Parasitic Free-living

Parasitic Free-living

sgpf-1c 22 14 0 0 0 0 2 0 0 0 1 4

sgpf-2 11 21 0 2 0 0 7 0 0 0 0 0

sgpf-3 11 9 1 0 2 1 0 1 0 0 0 0

sgpf-4 12 10 1 0 2 0 1 1 0 0 0 0

sgpf-5 13 11 0 0 0 0 0 1 0 0 1 1

sgpf-6 0 0 0 0 0 0 0 0 0 0 0 0

sgpf-7 17 0 16 0 0 0 0 0 0 0 0 0

sgpf-8 21 0 12 0 0 0 0 0 0 0 0 0

sgpf-9 8 0 8 0 0 0 0 0 0 0 0 0

a. SRAE – S. ratti; SSTP – S. stercoralis; SVE – S. venezuelensis; SPAL – S. papillosus; PRTK – Parastrongyloides trichosuri

b. TM – Trans-membrane

c. Strongyloides genome project family

Nature Genetics: doi:10.1038/ng.3495

Page 49: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

47

Supplementary Table 12. Summary of transcriptome and proteome data. Differentially expressed (DE) S. ratti and S. stercoralis genes were identified from pairwise comparisons of the transcriptomes of parasitic females (P), free-living females (FL) and infective third stage larvae (iL3s). In a three-way comparison genes were identified that were upregulated in one stage of the life cycle compared with both other stages. Astacin-like metallopeptidases („astacins‟) and SCP/TAPS coding genes were the most commonly upregulated gene family in the transcriptome of the parasitic females and these genes made up a large proportion of the total astacin and SCP/TAPS coding genes in the genomes (S. ratti has 184 astacins and 89 SCP/TAPS; S. stercoralis has 237 astacins and 113 SCP/TAPS). The S. ratti proteome of parasitic and free-living females was compared to identify differentially expressed proteins. Only proteins with more than one peptide were analyzed for differential expression but the number with just one peptide is also included below to highlight the presence of further proteins in the sample. For all comparisons, percentages above 50% are shown in bold.

N

um

ber

of

DE

gen

es

Nu

mb

er

of

as

tac

ins

that

are

DE

% D

E g

en

es t

hat

are

asta

cin

s

DE

asta

cin

s a

s a

%

of

all a

sta

cin

s

Nu

mb

er

of

SC

P/

TA

PS

th

at

are

DE

% o

f D

E g

en

es

th

at

are

SC

P/T

AP

S

DE

SC

P/T

AP

S a

s a

%

of

all

SC

P/

TA

PS

Transcriptome: Pairwise comparison

S. ratti

P (vs FL) 909 106 11.66 57.61 63 6.93 70.79

FL (vs P) 1470 10 0.68 5.43 5 0.34 5.62

iL3 (vs FL) 2943 47 1.60 25.54 22 0.75 24.72

FL (vs iL3) 3076 9 0.29 4.89 2 0.07 2.25

P (vs iL3) 3418 90 2.63 48.91 17 0.50 19.10

iL3 (vs P) 3296 60 1.82 32.61 19 0.58 21.35

S. stercoralis

P (vs FL) 1188 146 12.29 61.60 64 5.39 56.64

FL (vs P) 1109 13 1.17 5.49 8 0.72 7.08

iL3 (vs FL) 3598 59 1.64 24.89 46 1.64 40.71

FL (vs iL3) 3623 29 0.80 12.24 12 0.80 10.62

P (vs iL3) 4081 142 3.48 59.92 60 1.47 53.10

iL3 (vs P) 3767 32 0.85 13.50 34 0.90 30.09

Transcriptome: 3-way comparison

S. ratti

P 717 92 12.83 50.00 60 8.37 67.42

iL3 2646 21 0.79 11.41 17 0.64 19.10

FL 503 5 0.99 2.72 0 0.00 0.00

S. stercoralis

P 808 139 17.20 58.65 59 7.30 52.21

iL3 3097 29 0.94 12.24 33 1.07 29.20

Nature Genetics: doi:10.1038/ng.3495

Page 50: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

48

FL 354 1 0.28 0.42 1 0.28 0.88

Proteome: Pairwise comparison – excluding proteins with <2 peptides

P 569 4 0.70 2.17 0 0 0

FL 409 0 0 0 2 0.49 2.25

Proteome: Pairwise comparison – single peptides only

P 247 7 2.83 3.80 3 1.21 3.37

FL 238 1 0.42 0.54 2 0.84 2.25

Nature Genetics: doi:10.1038/ng.3495

Page 51: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

49

Supplementary Table 13. Results of edgeR analysis of differential gene expression in S. ratti and S. stercoralis. Differential gene expression analysis by R package edgeR version 3.6.7. Results are shown for the pairwise comparisons of parasitic female, free-living female and iL3 stages of the S. ratti and S. stercoralis life cycles based on RNA-seq data. The pseudocounts that are reported are calculated by edgeR. Genes are considered to be upregulated in a stage of the life cycle if FDR < 0.001 and FC > 2. Raw counts and pseudocounts are shown for replicates 1 and 2 of each life cycle stage as available. Column A shows the genes expressed significantly more in the respective life cycle stage; column B is the gene name; column O is the protein annotation (but column M for „S. ratti iL3 vs. FL‟ and „S. ratti Para vs. iL3‟). FC, fold-change; CPM, counts per million; FDR, false discovery rate. Supplementary Table 13 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 52: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

50

Supplementary Table 14. Orthologous genes that are upregulated in parasitic or free-living females. For S. ratti and S. stercoralis the number of (a) upregulated orthologs and (b) upregulated orthologs of key gene families, in the same life cycle of these two species.

(a) N

um

be

r o

f u

pre

gu

late

d g

en

es

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

an

ort

ho

log

in

bo

th s

pe

cie

s (

%)a

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

a o

ne

-to

-on

e o

rth

olo

g (

%)

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

ma

ny

-to

-ma

ny

ort

ho

log

s (

%)b

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

a o

ne

-to

-ma

ny

ort

ho

log

(%

)c

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

an

ort

ho

log

als

o u

pre

gu

late

d i

n t

he

sa

me

li

fe

cy

cle

s

tag

e

of

bo

th

sp

ec

ies

(%

)

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

a o

ne

-to

-on

e o

rth

olo

g u

pre

gu

late

d

in t

he

sa

me

lif

e c

yc

le s

tag

e o

f b

oth

sp

ec

ies

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

ma

ny

-to

-ma

ny

o

rth

olo

gs

up

reg

ula

ted

in

th

e s

am

e li

fe c

yc

le

sta

ge

of

bo

th s

pe

cie

sb

Nu

mb

er

of

up

reg

ula

ted

ge

ne

s w

ith

a

on

e-t

o-m

an

y

ort

ho

log

up

reg

ula

ted

in

th

e s

am

e li

fe c

yc

le

sta

ge

of

bo

th s

pe

cie

sc

,d

S. ratti

Parasitic 909 739 (81) 500 (55) 141 (16) 98 (11) 423 (47) 204 136 83

Free-living 1470 1379 (94) 1307 (89) 32 (2) 40 (3) 517 (35) 488 18 11

S. stercoralis

Parasitic 1188 945 (80) 589 (50) 168 (14) 188 (16) 457 (39) 204 143 110

Free-living 1109 1023 (92) 953 (86) 42 (4) 28 (3) 522 (47) 488 22 12

a. Note, throughout, all % calculations are compared to column “Number of upregulated genes”. b. Many-to-many orthologs are defined as having multiple orthologous genes present in both species. c. One-to-many orthologs are defined as having a single gene in one species and multiple orthologous genes in the other species.

d. Each gene is counted only once.

Nature Genetics: doi:10.1038/ng.3495

Page 53: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

51

(b) N

um

be

r o

f u

pre

gu

late

d g

en

es

wit

h

an

o

rth

olo

g

als

o

up

reg

ula

ted

in

th

e

sa

me

li

fe

cy

cle

sta

ge

of

bo

th s

pe

cie

sa

Nu

mb

er

of

the

se

b

tha

t a

re

as

tac

in c

od

ing

ge

ne

s (

% o

f a

ll

up

reg

ula

ted

a

sta

cin

c

od

ing

ge

ne

s)

Nu

mb

er

of

the

se

b

tha

t a

re

SC

P/T

AP

S c

od

ing

ge

ne

s (

% o

f

all

u

pre

gu

late

d

SC

P/T

AP

S

co

din

g g

en

es

)

Nu

mb

er

of

the

se

b

tha

t a

re

pro

lyl

en

do

pe

pti

da

se

c

od

ing

ge

ne

s

(%

of

all

p

roly

l

en

do

pe

pti

da

se

co

din

g g

en

es

)

Nu

mb

er

of

the

se

b

tha

t a

re

tra

ns

thy

reti

n-l

ike

c

od

ing

ge

ne

s (%

o

f a

ll tr

an

sth

yre

tin

-

lik

e c

od

ing

ge

ne

s)

Nu

mb

er

of

the

se

b

tha

t a

re

try

ps

in

inh

ibit

or-

lik

e

co

din

g

ge

ne

s

(%

of

all

tr

yp

sin

inh

ibit

or

lik

e c

od

ing

ge

ne

s)

S. ratti

Parasitic 423 101 (95.3) 53 (84.1) 15 (88.2) 9 (81.8) 10 (83.3)

Free-living 517 5 (50) 2 (40) 1 (100) 2 (50) 1 (100)

S. stercoralis

Parasitic 457 132 (90.4) 60 (93.8) 10 (90.9) 7 (35) 8 (80)

Free-living 522 7 (53.8) 2 (25) 1 (50) 2 (100) 1 (50)

a. From part (a) of Table, above. b. Of the column “Number of upregulated genes with an ortholog also upregulated in the same life cycle stage of both species” .

Nature Genetics: doi:10.1038/ng.3495

Page 54: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

52

Supplementary Table 15. Enriched Compara gene families. Compara gene families enriched among genes upregulated in parasitic females, free-living females, and iL3s in S. ratti (transcriptome and proteome) and S. stercoralis (transcriptome). For each set of genes or proteins found to be upregulated the hypergeometric test was used to determine whether each family with members in this set was enriched. P values were corrected for multiple testing using the Benjamini-Hochberg method.

Compara family

Compara family description Members upregulated

Number per species

Hypergeometric test P value

Benjamini-Hochberg

corrected P value

S. ratti transcriptome up in free-living females

22971 Hypothetical protein 50 5 6 4.16E-07 1.25E-4

15459 Glycoside hydrolase, catalytic domain and Glycoside hydrolase, superfamily domain-containing protein 18

3 3 5.20E-05 7.86E-3

S. ratti transcriptome up in parasitic females

12005 Astacin-like metalloendopeptidase 31 44 3.05E-29 9.55E-27

29264 CAP domain-containing protein 22 25 5.52E-25 8.64E-23

23892 CAP domain-containing protein 19 20 3.21E-23 3.35E-21

17558 Astacin-like metalloendopeptidase 19 22 2.22E-21 1.74E-19

73567 (sgpf7)

Hypothetical protein 16 17 1.62E-19 1.01E-17

9648 CAP domain-containing protein 16 18 1.38E-18 7.19E-17

18981 Trypsin Inhibitor-like, cysteine rich domain-containing protein

7 10 1.94E-07 8.68E-06

24788 Homeobox domain and Homeodomain-like-containing protein

7 11 5.07E-07 1.99E-05

9888 Astacin-like metalloendopeptidase 5 5 5.91E-07 2.05E-05

14690 Astacin-like metalloendopeptidase 7 12 1.16E-06 3.62E-05

14602 Prolyl endopeptidase 6 9 2.42E-06 6.87E-05

1377 Acetylcholinesterase 7 16 1.37E-05 3.57E-04

3190 Astacin-like metalloendopeptidase 7 18 3.44E-05 8.29E-04

22738 CAP domain-containing protein 4 5 4.98E-05 1.11E-03

Nature Genetics: doi:10.1038/ng.3495

Page 55: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

53

14778 Astacin-like metalloendopeptidase 5 9 6.14E-05 1.28E-03

5599 Astacin-like metalloendopeptidase 6 15 0.000107096 2.10E-03

5781 Aspartic peptidase family and Aspartic peptidase domain-containing protein

5 10 0.000117011 2.15E-03

188058 Hypothetical protein 3 3 0.000184083 3.03E-03

193598 Hypothetical protein 3 3 0.000184083 3.03E-03

152890 Hypothetical protein 4 7 0.000317931 4.98E-03

S. ratti transcriptome up in iL3s

S. ratti proteome up in free-living females

S. ratti proteome up in parasitic females

7965 Protein lethal(2)essential for life 7 17 1.06E-05 0.004741645

S. stercoralis transcriptome up in free-living females

22971 Hypothetical protein 7 9 4.05E-10 1.13E-07

16959 Epidermal growth factor-like domain and C-type lectin domain and von Willebrand factor, type A domain and MD domain and C-type lectin-like domain and C-type lectin fold domain-containing protein

4 7 1.90E-05 2.64E-03

35766 Reverse transcriptase domain and Integrase, catalytic core domain and Ribonuclease H-like domain-containing protein

4 8 3.71E-05 3.44E-03

21774 Hypothetical protein 3 4 8.26E-05 4.59E-03

45668 Bloom syndrome protein 3 4 8.26E-05 4.59E-03

35386 Hypothetical protein

3 5 0.000202238 9.37E-03

S. stercoralis transcriptome up in parasitic females

14690 Astacin-like metalloendopeptidase 41 48 1.07E-43 3.58E-41

1377 Acetylcholinesterase 27 30 1.88E-30 3.15E-28

8106 Astacin-like metalloendopeptidase 34 62 7.97E-26 8.90E-24

29264 CAP domain-containing protein 18 19 1.34E-21 1.12E-19

64867 Hypothetical protein 17 19 1.95E-19 1.31E-17

Nature Genetics: doi:10.1038/ng.3495

Page 56: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

54

9648 CAP domain-containing protein 16 18 2.99E-18 1.67E-16

68213 Hypothetical protein 14 14 6.38E-18 3.05E-16

17558 Astacin-like metalloendopeptidase 16 20 8.45E-17 3.54E-15

5599 Astacin-like metalloendopeptidase 16 24 1.02E-14 3.81E-13

144731 Hypothetical protein 9 9 9.09E-12 3.04E-10

18981 Trypsin Inhibitor-like, cysteine rich domain-containing protein

9 11 4.48E-10 1.25E-08

57757 Hypothetical protein 10 14 4.30E-10 1.25E-08

157174 Hypothetical protein 7 7 2.61E-09 6.23E-08

23892 CAP domain-containing protein 7 7 2.61E-09 6.23E-08

102788 Chromo domain/shadow and Chromo domain-like and Chromo domain-containing protein

9 13 5.22E-09 1.09E-07

14602 Prolyl endopeptidase 9 13 5.22E-09 1.09E-07

140709 Transthyretin-like family-containing protein 8 10 6.22E-09 1.23E-07

14778 Astacin-like metalloendopeptidase 8 11 2.16E-08 4.02E-07

131656 Hypothetical protein 6 6 4.40E-08 7.03E-07

160918 Hypothetical protein 6 6 4.40E-08 7.03E-07

162878 Hypothetical protein 6 6 4.40E-08 7.03E-07

22738 CAP domain-containing protein 8 12 6.14E-08 9.36E-07

72111 Hypothetical protein 6 8 1.11E-06 1.62E-05

12005 Astacin-like metalloendopeptidase 8 18 3.94E-06 5.50E-05

9888 Astacin-like metalloendopeptidase 5 6 4.24E-06 5.68E-05

4663 Phosphate-regulating neutral endopeptidase 6 10 7.51E-06 9.68E-05

27350 Astacin-like metalloendopeptidase 4 4 1.25E-05 1.56E-04

5781 Aspartic peptidase family and Aspartic peptidase domain-containing protein

5 9 7.65E-05 9.16E-04

3190 Astacin-like metalloendopeptidase 8 28 0.000164025 1.83E-03

8247 CAP domain-containing protein 8 28 0.000164025 1.83E-03

154276 Hypothetical protein 3 3 0.000211048 1.91E-03

158006 Transthyretin-like family-containing protein 3 3 0.000211048 1.91E-03

183313 Hypothetical protein 3 3 0.000211048 1.91E-03

183333 Hypothetical protein 3 3 0.000211048 1.91E-03

Nature Genetics: doi:10.1038/ng.3495

Page 57: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

55

192153 Hypothetical protein 3 3 0.000211048 1.91E-03

194198 Hypothetical protein 3 3 0.000211048 1.91E-03

195283 Hypothetical protein 3 3 0.000211048 1.91E-03

112667 Hypothetical protein 4 8 0.000722761 6.21E-03

53213 Protein angel 4 8 0.000722761 6.21E-03

32246 M-phase inducer phosphatase family and Rhodanese-like domain-containing protein

3 4 0.000806591 6.59E-03

69551 Proteasomal ubiquitin receptor ADRM1 3 4 0.000806591 6.59E-03

S. stercoralis transcriptome up in iL3s

7102 Histone-lysine N-methyltransferase SETMAR 37 55 1.70E-11 4.02E-08

12314 Gamma-aminobutyric acid A receptor/Glycine receptor alpha family and Neurotransmitter-gated ion-channel transmembrane domain and Neurotransmitter-gated ion-channel family and Neurotransmitter-gated ion-channel ligand-binding domain-containing protein

10 10 6.91E-07 8.16E-04

7537 Gamma-aminobutyric acid A receptor/Glycine receptor alpha family and Neurotransmitter-gated ion-channel transmembrane domain and Neurotransmitter-gated ion-channel family and Neurotransmitter-gated ion-channel ligand-binding domain-containing protein

11 12 1.56E-06 1.23E-03

12518 Prolactin-releasing peptide receptor 9 9 2.86E-06 1.69E-03

Nature Genetics: doi:10.1038/ng.3495

Page 58: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

56

Supplementary Table 16. Enrichment of gene ontology annotation terms among differentially expressed genes of S. ratti and S. stercoralis. Enrichment of molecular function (MF), biological process (BP), cellular component (CC). Gene ontology (GO) terms calculated using the R package TopGO version 2.16.0. Genes upregulated in parasitic females, free-living females and iL3s, compared to all other genes in the genome, for S. ratti and S. stercoralis. Significant GO terms are listed (Fisher‟s exact test, P <0.01, columns G and N); GO terms common to the same stage of the life cycle for S. ratti and S. stercoralis are highlighted in pink. Columns D and K are the number of genes annotated to a specific GO term genome-wide; columns E and L are the number of genes annotated to the same GO term but within the input data of interest; columns F and M are the number of genes in the input data of interest that are expected to be annotated to a GO term by chance. Supplementary Table 16 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 59: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

57

Supplementary Table 17. Results of LC-MS proteome analysis of S. ratti. (a) Proteome data for the parasitic female and free-living female stages of the S. ratti life cycle, where all peptides were used for identification of proteins, but only peptides unique to a single protein were used for quantification. (b) Proteins identified by the presence of only a single peptide were not used to calculate protein abundance and not included in further analyses. In (a) and (b) proteome abundance (iBAQ) was calculated from peak intensities identified by Progenesis. Relative protein abundance, measured by fold-change, between the two stages was considered significant if q < 0.05. The q value is an adjusted P value based on the false discovery rate. Column A shows the genes expressed significantly more in the respective life cycle stage; column B is the gene name; column H is the protein annotation; columns J-O and R-W are the raw data for biological replicates of free-living and parasitic samples (FL1, FL2, P1, P2) and technical triplicate analyses of each sample (01, 02, 03), with the respective means shown in columns P and X. (c) Enrichment of gene ontology (GO) terms analyzed by the R (version 3.1.2) package TopGO version 2.16.0, are shown for Molecular Function, Biological Processes and Cellular Components. GO enrichment is shown for (i) the pairwise comparison of proteins differentially expressed in parasitic females and free-living females, and (ii) differentially expressed proteins of parasitic or free-living females compared with all other predicted proteins in the S. ratti genome. Supplementary Table 17 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 60: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

58

Supplementary Table 18. Comparison of the proteome and the transcriptome of S. ratti. (a) The somatic and excretory/secretory (ES) proteome, and (b) the relationship of the genes and proteins differentially expressed between the parasitic and free-living females. (a)

Somatic proteomea Upregulated

somatic proteome

ES proteomeb Combined proteomec

Somatic and ES proteome overlap

Parasitic 857 569 582 978 173

Free-living 697 409 569 882 96

a. The non-ES proteome. b. Data from 87 and includes additional data collected at the same time. c. The combined upregulated somatic and ES proteome.

Nature Genetics: doi:10.1038/ng.3495

Page 61: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

59

(b)

Transcriptome:

number of differentially

expressed genesa

Proteome: number of differentially

expressed proteinsa

Overlap between differentially expressed

transcriptome and the differentially

expressed somatic proteomeb

Overlap between the differentially expressed

transcriptome and the ES proteome

Parasitic 909 569 53 (5.8c) 119 (13.1c)

Free-living 1470 409 151 (10.3c) 151 (10.3c)

a. From Supplementary Table 12, which is the somatic proteome only. b. From Supplementary Figure 5. c. This overlap expressed as a percentage of the differentially expressed transcriptome.

Nature Genetics: doi:10.1038/ng.3495

Page 62: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

60

Supplementary Table 19. Analysis of the excretory/secretory (ES) proteome of S. ratti. (a) Proteome data of the parasitic female and free-living female ES. Proteins were only included if they had a FDR < 0.01 and at least two unique peptides. Protein abundance was calculated using the Exponentially Modified Protein Abundance Index (emPAI) (column F). (b) The proteins that are common to the parasitic and free-living female ES proteomes. Supplementary Table 19 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 63: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

61

Supplementary Table 20. Clusters of physically adjacent genes upregulated in the same stage of the life cycle. For (a) S. ratti, (b) S. stercoralis, (c) S. ratti, excluding genes from the most common gene families, i.e. astacin-like metallopeptidase and SCP/TAPS coding genes and (d) S. stercoralis, excluding genes from the most common gene families, i.e. astacin-like metallopeptidase and SCP/TAPS coding gene. Percentages are shown in parentheses. (e) Fisher‟s exact P values for pairwise comparisons of the different stages of the life cycle (i.e. parasitic female, free-living female and iL3s) to test (i) (grey cells) if genes upregulated in the parasitic female stage were more likely to occur in clusters compared with the other life cycle stages and (ii) (white cells) if parasitic females clusters were more likely to belong to the same Compara gene family compared with the other life cycle stages, we compared the number of clusters with a common Compara gene family (free-living females were excluded from this analysis due to the low number of clusters in this stage of the life cycle). The P value for comparisons of all clusters (i.e. the data in tables (a) and (b)) is shown in each cell and the P value for clusters excluding astacins and SCP/TAPS coding genes (i.e. the data in tables (c) and (d)) are shown in parentheses. Significant values after Bonferroni correction are in bold. s.d. - standard deviation. (a) S. ratti Genes

upregulateda Genes in clustersb

Number of clusters

Mean expected number of clustersc ± s.d.

(P value)

Clusters with common Compara family

Range of cluster sizes

Parasitic females 717 222 (30.96%) 46 2.17 ± 1.50 (0.001) 39 (88.4%) 3-19

Free-living females 503 18 (3.58%) 6 0.81 ± 0.91 (0.001) 2 (33.3%) 3

iL3s 2646 686 (25.93%) 191 92.45 ± 8.42 (0.001) 19 (9.95%) 3-9

(b) S. stercoralis Genes

upregulateda Genes in clustersb

Clusters Mean expected no. of clustersc ± s.d.

(P value)

Clusters with common Compara family

Range of cluster sizes

Parasitic females 808 273 (33.79%) 59 2.67 ±1.57 (0.001) 43 (72.9%) 3-16

Free-living females 354 6 (1.69%) 2 0.21 ± 0.47 (0.001) 0 3

iL3s 3097 1053 (34%) 275 122.96 ± 9.51 (0.001) 22 (8%) 3-14

Nature Genetics: doi:10.1038/ng.3495

Page 64: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

62

(c) S. ratti Genes

upregulateda Genes in clustersb

Number of clusters

Mean expected number of clustersc ± s.d.

(P value)

Clusters with common Compara gene familiesd excluding astacin-like

and SCP/TAPS

Range of cluster sizes

Parasitic females 565 109 (19.29%) 26 1.10 ± 1.02 (0.001) 22 (84.6%) 3-14

Free-living females 498 18 (3.61%) 6 0.74 ± 0.83 (0.001) 2 (33.3%) 3

iL3s 2608 677 (25.96%) 190 89.02 ± 8.39 (0.001) 18 (9.47%) 3-7

(d) S. stercoralis Genes

upregulateda Genes in clustersb

Number of clusters

Mean expected number of clustersc ± s.d.

(P value)

Clusters with common Compara gene familiesd excluding astacin-like

and SCP/TAPS

Range of cluster sizes

Parasitic females 610 158 (25.90%) 36 1.12 ± 1.05 (0.001) 23 (63.89%) 3-16

Free-living females 352 6 (1.7%) 2 0.22 ± 0.46 (0.001) 0 (0%) 3

iL3s 3035 1006 (33.15) 266 115.92 ± 8.97 (0.001) 18 (6.77%) 3-12

(e) P values

Parasitic females Free-living females iL3s

S. ratti S. stercoralis S. ratti S. stercoralis S. ratti S. stercoralis

Parasitic females - - < 2.2e-16 (1.738e-15)

< 2.2e-16 (1.114e14)

Free-living females < 2.2e-16 (< 2.2e-16)

< 2.2e-16 (< 2.2e-16)

- -

iL3s 0.007897 (1.059e-12)

0.9335 (0.00042)

< 2.2e-16 (< 2.2e-16)

< 2.2e-16 (< 2.2e-16)

a. Genes upregulated (based on RNA-seq data) in specific life cycle stage compared to both other life cycle stages.

b. Clusters defined as a minimum of three genes that are physically adjacent.

c. The number of clusters expected when the same number of genes in column „Genes upregulated‟ were randomly selected from the genome. This

value is the mean of 1000 such randomizations.

d. Clusters where 50% or more genes were from the same Compara gene family (see Supplementary Table 8 for Compara gene families).

Nature Genetics: doi:10.1038/ng.3495

Page 65: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

63

Supplementary Table 21. Astacin and SCP/TAPS coding gene clusters. Clusters of all genes across the genome encoding (a) astacin-like metallopeptidases and (b) SCP/TAPS, in the genomes of four Strongyloides spp., Parastrongyloides trichosuri, Rhabditophanes sp. (all in bold) and eight outgroup species. (a)

Species Astacina

genes % of astacin

a

genes in clusters

b

Number of clusters

Number of clusters expected by

chance (s.d.)c

S. ratti 184 45% 12 0.02 (0.14)

S. stercoralis 237 56% 19 0.06 (0.24)

S. papillosus 302 8% 5 0.12 (0.32)

S. venezuelensis 217 19% 11 0.02 (0.14)

P. trichosuri 387 17% 17 0.18 (0.38)

Rhaditophanes sp. 36 0% 0 0

A. suum 25 16% 1 0.00 (0.00)

B. malayi 19 0% 0 0.00 (0.00)

B. xylophilus 26 0% 0 0.00 (0.00)

C. elegans 40 0% 0 0.00 (0.00)

M. hapla 31 0% 0 0.00 (0.00)

N. americanus 82 16% 3 0.00 (0.00)

T. muris 14 0% 0 0.00 (0.00)

T. spiralis 16 0% 0 0.00 (0.00)

(b)

Species SCP/TAPS genes

% of SCP/TAPS genes in clusters

b

Number of

clusters

Number of clusters expected by

chance (s.d.)c

S. ratti 89 55.6% 8 0.00 (0.00)

S. stercoralis 113 47.4% 11 0.00 (0.00)

S. papillosus 205 36.9% 15 0.02 (0.14)

S. venezuelensis 159 51.3% 16 0.00 (0.00)

P. trichosuri 51 11.5% 2 0.00 (0.00)

Rhaditophanes sp. 12 0.0% 0 0.00 (0.00)

A. suum 21 0.0% 0 0.00 (0.00)

B. malayi 8 37.5% 1 0.00 (0.00)

B. xylophilus 25 24.0% 1 0.00 (0.00)

C. elegans 36 8.3% 1 0.00 (0.00)

M. hapla 21 0.0% 0 0.00 (0.00)

N. americanus 138 6.5% 2 0.02 (0.14)

T. muris 28 0.0% 0 0.00 (0.00)

T. spiralis 15 0.0% 0 0.00 (0.00)

a. Astacin-like metallopeptidase protein-coding genes.

b. Clusters defined as a minimum of three genes that are physically adjacent.

c. The number of clusters expected when the same number of genes found across the genome for that

gene family were randomly selected from the genome. This value is the mean (and s.d.) of 50 such

randomizations.

s.d. – standard deviation

Nature Genetics: doi:10.1038/ng.3495

Page 66: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

64

Supplementary Table 22. Results of analysis of gene clusters. Clusters co-expressed in the transcriptome of parasitic females, iL3s and free-living females of S. ratti and S. stercoralis. The members of a cluster were considered to share a common gene family where ≥50% of genes belonged to the same Compara gene family. Using this definition, some clusters are counted twice, e.g. where a cluster of four genes comprises two genes of one gene family and two genes of another gene family. Column C shows the total number of genes in clusters with a given common gene family, including genes that are not members of that gene family. Column D shows the range of the number of genes in a cluster and a cluster is required to have at least three genes. Supplementary Table 22 is an Excel file.

Nature Genetics: doi:10.1038/ng.3495

Page 67: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

65

Supplementary Table 23. Genomic libraries. (a) Illumina sequence data for genome assembly (Supplementary Notes 5 and 7); (b) Shotgun (SG) and paired end (PE) 454 sequence data for S. ratti genome assembly (Supplementary Notes 5 and 7); (c) Capillary sequence data for S. ratti genome assembly (Supplementary Notes 5 and 7), data accessible from the Trace Archive), using the sequencing library identifiers (SEQ_LIB_ID); (d) Illumina paired-end sequence data for sex-specific re-sequencing (Supplementary Notes 2, 3 and 5). (a)

Organism Mean insert size (bp)

Read length (bp)

Total yield (kb) ENA accession number

Library typea

S. ratti 159 75 26,179,199 ERS018108 PCR-free

S. ratti 432 100 2,716,143 ERS193620 PCR-free

S. stercoralis 460 100 31,882,527 ERS055226 PCR-free

S. stercoralis 1799 100 23,799,465 ERS067019 Long-range mate-pair

S. papillosus 426 100 36,091,856 ERS055231 PCR-free

S. papillosus 2615 100 18,628,881 ERS067023 Long-range mate-pair

S. venezulensis 195.3 100 11,438,945 DRX007648 Paired-end

S. venezulensis 453 100 41,789,472 DRX007649 Paired-end

S. venezulensis 3000 100 42,418,439 DRX007650 Long-range mate-pair

S. venezulensis 5000 100 31,188,873 DRX007651 Long-range mate-pair

P. trichosuri 430 100 24,419,815 ERS056619 PCR-free

P. trichosuri 2308 100 23,165,070 ERS067639 Long-range mate-pair

Rhabditophanes sp. 327 100 28,411,136 ERS193619 PCR-free

a. PCR-free and Long-range mate-pair libraries are detailed in Supplementary Note 5. (b)

Library type Mean insert size (kb)

Total yield (Mb) ENA accession number

PE 3 58.2 ERS499877

PE 3 349.3 ERS499877

PE 8 306.3 ERS499877

SG n/a 500.4 ERS499877

Nature Genetics: doi:10.1038/ng.3495

Page 68: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

66

(c)

Vector Insert range (kb) Number of reads Trace archive SEQ_LIB_ID

pUC19 2-3 93,704 114582

pUC19 3-5 79,817 114583

pOTW12 2-3 238 116656

pOTW12 3-4 92,996 116657

pOTW12 4-5 99,567 116658

pMAQ1Sac_BstXI 7-10 69,922 116661

(d)

Organism Stage and sex Mean insert size (bp)

Total yield (kb) ENA accession number

S. ratti Infective L3s (female) 493 5,610,205 ERS364225

S. ratti Adult males 453 7,928,968 ERS364226

S. stercoralis Free-living females 376 8,363,162 ERS370813

S. stercoralis Free-living males 388 9,013,076 ERS370814

S. papillosus Infective L3s (female) 390 6,317,067 ERS364223

S. papillosus Adult males 354 9,297,269 ERS364224

P. trichosuri Adult females 427 8,750,020 ERS370812

P. trichosuri Adult males 419 7,959,037 ERS370815

S. ratti Mixed L1 and L2 418 7,123,859 ERS420118

S. stercoralis Infective L3 434 13,942,009 ERS420119

P. trichosuri Mixed L1 and L2 437 9,484,599 ERS420120

Nature Genetics: doi:10.1038/ng.3495

Page 69: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

67

Supplementary Table 24. RNA sequencing data sets. For (a) S. ratti, (b) S. venezuelnesis and (c) S. stercoralis. S. stercoralis RNA-seq data was previously published3. For each sample, we indicate whether it was used for differential expression (DE) analysis, and/or creating hints for Augustus for gene-finding. (a)

Life cycle stage Number of reads

ENA experiment accession number

ENA sample accession

number

ENA study accession

number

Insert size (bp)

Purpose

Parasitic females 138,125,504 ERX272493, ERX272499 ERS209991 ERP002187 392 DE; Augustus

Parasitic females 81,540,240 ERX272494, ERX272500 ERS209992 ERP002187 392 DE; Augustus

Parasitic females 123,033,982 ERX272495, ERX272501 ERS209993 ERP002187 391 DE; Augustus

Free-living females 89,926,696 ERX272496, ERX272502 ERS209994 ERP002187 402 DE; Augustus

Free-living females 142,331,288 ERX272497, ERX272503 ERS209995 ERP002187 408 DE; Augustus

Free-living females 78,396,490 ERX272498, ERX272504 ERS209996 ERP002187 397 DE; Augustus

Free-living adults 144,969,604 ERX200443 ERS091917 ERP001672 304 Augustus

iL3s 11,719,402 ERX200444 ERS092590 ERP001672 411 DE; Augustus

Nature Genetics: doi:10.1038/ng.3495

Page 70: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

68

(b)

Life cycle stage Number of reads

DRA experiment accession

number

DRA sample accession

number

DRA study accession

number

Insert size (bp)

Purpose

Eggs 75,638,760 DRX026342 SAMD00024931 PRJDB3457 178 Augustus

Eggs 59,672,954 DRX026505 SAMD00024931 PRJDB3457 160 Augustus

First stage larvae 96,440,828 DRX026493 SAMD00024932 PRJDB3457 171 Augustus

First stage larvae 63,409,282 DRX026494 SAMD00024932 PRJDB3457 165 Augustus

iL3s 98,931,638 DRX026495 SAMD00024935 PRJDB3457 173 Augustus

iL3s 46,562,246 DRX026496 SAMD00024935 PRJDB3457 161 Augustus

Lung-iL3 90,538,632 DRX026497 SAMD00024933 PRJDB3457 180 Augustus

Lung-iL3 36,745,178 DRX026498 SAMD00024933 PRJDB3457 162 Augustus

Young parasitic females 86,606,902 DRX026499 SAMD00024934 PRJDB3457 163 Augustus

Young parasitic females 47,368,324 DRX026500 SAMD00024934 PRJDB3457 167 Augustus

Parasitic females 107,758,646 DRX026501 SAMD00024930 PRJDB3457 167 Augustus

Parasitic females 44,404,224 DRX026502 SAMD00024930 PRJDB3457 163 Augustus

Induced-iL3s 1 day 93,904,264 DRX026503 SAMD00024936 PRJDB3457 177 Augustus

Induced-iL3s 5 day 76,859,740 DRX026504 SAMD00024937 PRJDB3457 179 Augustus

Nature Genetics: doi:10.1038/ng.3495

Page 71: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

69

(c)

Life cycle stage Number of reads

ENA experiment accession

number

ENA sample accession

number

ENA study accession

number

Insert size (bp)

Purpose

Young gravid free-living females with 1-6 eggs per gonadal arm 127,414,598 ERX122884 ERS152080 ERP001556 170 ± 50 DE; Augustus

Young gravid free-living females with 2-6 eggs per gonadal arm 124,169,962 ERX122876 ERS152073 ERP001556 170 ± 50

DE

Young gravid free-living females with 2-10 eggs per gonadal arm 119,384,936 ERX122888 ERS152084 ERP001556 170 ± 50 DE

Tissue migrating iL3s 68,856,031 ERX122882 ERS152078 ERP001556 170 ± 50 Augustus

Young iL3s 100,763,662 ERX122883 ERS152079 ERP001556 170 ± 50 DE; Augustus

iL3s 82,411,800 ERX122871 ERS152067 ERP001556 170 ± 50 DE

iL3s 81,413,132 ERX122880 ERS152076 ERP001556 170 ± 50 DE

Gravid parasitic females 87,340,720 ERX122870 ERS152066 ERP001556 170 ± 50 DE; Augustus

Gravid parasitic females 86,752,130 ERX122878 ERS152074 ERP001556 170 ± 50 DE

Gravid parasitic females 92,282,446 ERX122876 ERS152072 ERP001556 170 ± 50 DE

Post free-living first stage larvae 75,330,900 ERX122869 ERS152065 ERP001556 170 ± 50 Augustus

Post parasitic first stage larvae 70,262,364 ERX122881 ERS152077 ERP001556 170 ± 50 Augustus

Post parasitic indirectly developing third stage larvae

48,613,492

ERX122887

ERS152083

ERP001556

170 ± 50

Augustus

Nature Genetics: doi:10.1038/ng.3495

Page 72: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

70

SUPPLEMENTARY FIGURES

Nature Genetics: doi:10.1038/ng.3495

Page 73: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

71

Supplementary Figure 1. Parsimony analysis of conserved intron regions.

Intron history in nematodes. At each node, the number of common intron gains (+) and losses (-), derived from the DOLLOP programme of the PHYLIP package104 are shown. A boxplot of exon length is shown for eight pan-phylum outgroup species (blue) and the six sequenced species of the present study (red). Boxes represent the interquartile range (IQR) and whiskers represent the 75 percentile plus 1.5 times the IQR.

Nature Genetics: doi:10.1038/ng.3495

Page 74: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Supplementary Figure 2. The distribution of differentially upregulated genes across the S. ratti and S. stercoralis genomes.

For (a) S. ratti and (b) S. stercoralis the distribution (in 100 kb bins) across the genome of all predicted genes in the genome, and those upregulated in parasitic females (red), iL3s (green)

!0.01%

0%

0.01%

0.02%

0.03%!0.01%

0%

0.01%

0.02%

0.03%!0.01%

0%

0.01%

0.02%

0.03%!0.01%

0%

0.01%

0.02%

0.03%

!0.01%

0%

0.01%

0.02%

0.03%!0.01%

0%

0.01%

0.02%

0.03%!0.01%

0%

0.01%

0.02%

0.03%!0.01%

0%

0.01%

0.02%

0.03%

(a) $S.#ra&$

iL3$

Parasi+c$female$

Genome$

Free5living$female$

(b) $S.#stercoralis$

iL3$

Parasi+c$female$

Genome$

Free5living$female$

Chromosome$I$ Chromosome$II$ Chromosome$X$

Chromosome$I$ Chromosome$II$ Chromosome$X$

Prop

or+o

n$of$gen

es$

Prop

or+o

n$of$gen

es$

$$

Page 75: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

and free-living females (blue) (with the distribution of the upregulated genes shown as a proportion of all genes upregulated in each of these relevant stages). Scaffold and contigs that were not assigned to a chromosome are not shown. Scaffold and contigs were excluded if they were smaller than 150 kb. After these exclusions, n values were as follows, S. ratti: genome, n=11596; parasitic females, n=659; free-living females, n=476; iL3, n=2552; and S. stercoralis: genome, n=8991; parasitic females, n=560; free-living females, n=241; iL3, n=2192. Tick marks below the x-axis represent scaffold boundaries. For S. ratti chromosomes I and II were arranged as a single scaffold; the 10 scaffolds which made up Chromosome X were arranged in descending size order. The average gene densities on the S. ratti chromosomes were 307.6 genes per Mb on chromosome I, 320.1 genes per Mb on chromosome II, and 223.1 genes per Mb on chromosome X scaffolds (taking all 10 scaffolds of chromosome X).

Page 76: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Supplementary Figure 3. Comparison of gene and repeat distribution in S. ratti and C. elegans chromosomes.

For (a) S. ratti and (b) C. elegans the proportion of base pairs attributed to different features in 100 kb windows was plotted along each chromosome on a scale of 0 to 1. We show genes (light blue); genes with BLAST hits with ≥ 60% amino acid identity to Saccharomyces cerevisiae (SGD v64-2-1) genes (dark blue), scaled to the maximum value for the respective chromosome; inverted repeats (magenta) and tandem repeats (red), also scaled to the maximum value for the respective chromosome. Inverted repeats were identified using Inverted Repeats Finder v3.0791 with options 2, 3, 5, 80, 10, 40, 500000, 10000, -d, -h, -t4 74, -t5 493, -t7 10000. Tandem repeats were identified using Tandem Repeat Finder90 with options 2, 1000, 1000, 80, 10, 25, 1000. S. ratti chromosome X was excluded from the analysis because it is incompletely assembled and therefore its structure cannot be analyzed.

II

S. ratti

C. elegans(b)

(a)

5 0 10 15

0

1

0

0

0.2

0.4

I

Feat

ure

dens

ity

Chromosomal position (Mb)

Genes

Yeast similarities

Inverted repeats

0.06

5 10 150

0.1

0

1

0

0

Feat

ure

dens

ity

Chromosomal position (Mb)

I

0.08

3 9

0

0

0

1

0 6 12

Feat

ure

dens

ity

Chromosomal position (Mb)

0.040

0.15

II

0.08

5 10

0

0

0

1

0 15

Feat

ure

dens

ity

0.04

Chromosomal position (Mb)

0

0.2

0

0.80.8

0

III

0.075

0.3

5 10

0

0

0

1

0

Feat

ure

dens

ity

Chromosomal position (Mb)

0.9

0

IV0.1

0.09

5 10 150

0

0

0

1

Feat

ure

dens

ity

Chromosomal position (Mb)

IV

0.75

0

V

0.06

0.15

5 10

0

0

0

1

0 15 20

Feat

ure

dens

ity

Chromosomal position (Mb)

0.9

0

X

0.04

0.3

5 10

0

0

0

1

0 15

Feat

ure

dens

ity

Chromosomal position (Mb)

0.4

0

Tandem repeats

Page 77: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

For S. ratti and C. elegans gene density is approximately constant across the chromosomes. However, genes with significant similarity to those of yeast are more skewed to chromosome centers in C. elegans than in S. ratti. Most strikingly, tandem repeats were found evenly throughout S. ratti chromosomes, but in C. elegans were skewed to its chromosome arms. Inverted repeats showed a similar pattern to tandem repeats although there appeared to be a less even coverage in S. ratti than for tandem repeats, with slightly increased coverage towards telomeres.

Page 78: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Supplementary Figure 4. The gain and loss of nematode gene families.

Parastrongyloides trichosuri

Brugia malayi

Rhabditophanes sp. KR3021

Strongyloides venezuelensis

Ascaris suum

Strongyloides papillosus

Strongyloides ratti

Trichuris muris

Trichinella spiralis

Necator americanus

Meloidogyne hapla

Caenorhabditis elegans

Bursaphelenchus xylophilus

Strongyloides stercoralis

duplications losses

genes

families with � 1

genes

families with � 1

010002000300040005000

+2874

+126

+174

+334+882

+108

+474+162

+221

+424

+381

+5647 +277

+670

+614

+55

+769

+656

+104

+87

+839+406

+1075

+319

+1127

+541

+334number of gene

families originating on branch

0.3

100

100100

100

100

100

100

100

100

100

100

100

Page 79: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

A phylogeny of the six species and eight outgroup species, annotated with the number of gene families appearing along each branch of the phylogeny (+values on each branch) and histograms showing the number of duplications (blue) and losses (red) for individual genes (dark blue or red) and for families (light blue or red) as estimated using Ensembl Compara. Values on nodes are the number of bootstrap replicate trees (out of 100) showing the split induced by the node. Phylogeny is based on the same analysis and alignment as that shown in main text Figure 1.

Page 80: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Supplementary Figure 5. The transcriptome and proteome of S. ratti.

856 51653

1319 258151

569 409288

(a)$

(b)$

Parasi&c(

Free+living(Proteome$

Transcriptome( Proteome(

Parasi.c$females$

Free3living$females$

811 71898

(d)$

Parasi.c$females$inc.$single$pep.des$

1251 428219

(c)$

Free3living$females$inc.$single$pep.des$

Transcriptome( Proteome(

(e)$

Transcriptome(Proteome(

Transcriptome(

Proteome(

(f)$

569$ 288$ 409$

856$ 516$53$

811$ 718$98$

1319$ 151$ 258$

1251$ 219$ 428$

0"

1000000"

2000000"

3000000"

4000000"

5000000"

6000000"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000"

iBAQ

%

RPKM%

Free.living"

Parasi7c"

0"

200000"

400000"

600000"

800000"

1000000"

1200000"

1400000"

1600000"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000"

iBAQ

%

RPKM%

Free.living"

Parasi7c"

(g)$

Page 81: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

(h)$

Transcriptome$log2$fold$change$

Proteo

me$log2$fo

ld$change$

!6#

!4#

!2#

0#

2#

4#

6#

8#

!15# !10# !5# 0# 5# 10# 15#

Free!living#

Parasi5c#

Page 82: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Acetylcholinesterase.

Aspar0c.pep0dase..

Astacin2like.m

etallopep0dase.

Calmodulin2like.protein.3.

Carboxylesterase,.type.B.domain2containing.protein.

Collagen.alpha25(IV).chain.

Domain.of.unknown.func0on.DB.domain2containing.protein.

Domain.of.unknown.func0on.DU

F148.domain2containing.protein.

G.protein2coupled.receptor,.rhodopsin2like.family;..

Galec0n.

Glycoside.hydrolase.

His0dine.phosphatase.superfamily,.clade222containing.protein.

Lipase,.class.3.family2containing.protein.

Nematode.cu0cle.collagen,.N2terminal.domain.

Prolyl.endopep0dase.

Protein.lethal(2)essen0al.for.life.

Protein2tyrosine.phosphatase.

SCP/TAPS.

ShKT.domain2containing.protein.

Transcrip0on.factor.HNF24.homolog.

Transthyre0n2like.family2containing.protein.

Trypsin.Inhibitor2like.

UDP2glucuronosyl/UDP2glucosyltransferase..

Parasi0c.proteome.

Free2living.proteome.

Parasi0c.transcriptome.

Free2living.transcriptome.

10$

$$0$

20$

60$

70$

80$

90$

100$

110$

Astacin+like(metallopeptodase(

SCP/TAPS(No.$of$gen

es/proteins$

(i)$

Page 83: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Summary of the S. ratti proteome data. Venn diagrams represent the number of genes / proteins upregulated, and their overlap, between (a) the parasitic and free-living female proteomes; (b) the parasitic female transcriptome and proteome; (c) as (b) but excluding proteins identified by only a single peptide; (d) the free-living female transcriptome and proteome; (e) as (d) but excluding proteins identified by only a single peptide. A total of 1,266 proteins were identified with > 1 peptide. A further 675 proteins were identified with only a single peptide. The protein abundance (iBAQ) and transcript abundance (RPKM) were compared for (f) those proteins whose gene was identified in the transcriptome, which were positively correlated (Pearson product-moment correlation coefficient: parasitic females, r = 0.48, t = 19.4629, df = 1260, P < 2.2e-16; free-living females, r = 0.52, t = 21.8487, df = 1260, P < 2.2e-16), shown for parasitic (red squares, red line of best fit), and free-living females (blue diamonds, blue line of best fit), and (g) proteins whose gene was also upregulated in the same life cycle stage in the transcriptome, which were positively correlated (Pearson product-moment correlation coefficient: parasitic females, r = 0.70, t = 7.0058, df = 51, P = 5.33e-09; free-living females, r = 0.26, t = 3.2957, df = 149, P = 0.001227). Both the iBAQ and RPKM measurements account for protein or gene length. (h) The log2 fold change of transcripts and of proteins upregulated in parasitic females and in free-living females, which were positively correlated (Pearson product-moment correlation coefficient: parasitic females r = 0.57, t = 4.9742, df = 51, P = 7.822e-06; free-living females r = 0.58, t = 8.69, df = 149, P = 5.995e-15). The negative fold change values are genes / proteins present at a greater level in free-living females; positive values are those present at a greater level in parasitic females. (i) The most common protein and protein-coding gene families upregulated in the proteome and transcriptome of parasitic and free-living females are shown. Only genes found to be significantly upregulated in pairwise comparisons of parasitic and free-living females are included, as determined by edgeR (transcriptome; FDR<0.001, fold change >2) and ANOVA (proteome; q < 0.05). Protein and protein-coding gene families with less than 5 proteins / genes in at least one category are not shown. FDR, false discovery rate.

Page 84: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Supplementary Figure 6. Gene clustering in S. ratti and S. stercoralis.

(a) The number of clusters increases with the number of genes in a given data set as observed for S. ratti and S. stercoralis (black triangles) and when the same number of genes are selected randomly from the genome, based on 100 randomizations (mean ± s.d.) (grey circles). Data points are based on the data shown in Supplementary Table 20 where each data point represents all clusters for a stage of the life cycle (parasitic females, free-living females and iL3) either including or excluding data on astacin-like metallopeptdases and SCP/TAPS. (b) Intergenic distances for genes across the whole S. ratti genome (‘Genome’) (n=12,338), genes in clusters upregulated in the parasitic female stage (‘Clusters’) (n=169), and genes in clusters upregulated in the parasitic stage that comprise ≥ 50% of genes from the same gene family (‘Gene family clusters’) (n=157) were not significantly different (ANOVA, F value = 1.801, P = 0.65). Boxplots represent median (horizontal black line) and interquartile range (box), and the range of data points excluding extreme outliers (whiskers). Genes unassigned to a chromosome were excluded from the analysis. s.d. – standard deviation.

Supplementary Figure 7. The chromosome number of Rhabditophanes sp.

0"

50"

100"

150"

200"

250"

300"

0" 1000" 2000" 3000" 4000"

(a)"

No."of"genes"

No."of"clusters"

SI#Figure#X."A."The"number"of"clusters"increases"with"the"number"of"genes"in"a"given"data"set"as"observed"for"S.#ra&#and"S.#stercoralis#("""")#and"when"the"same"number"of"genes"are"selected"randomly"from"the"genome,"based"on"100"randomisaBons"(mean"±"std)"("""").""B."Intergenic"distances"for"genes"across"the"whole"genome"(n=15558),"genes"in"clusters"upregulated"in"the"parasiBc"female"stage"(n=169),"and"genes"in"clusters"upregulated"in"the"parasiBc"stage"that"comprise"≥"50%"of"genes"from"the"same"gene"family"(n=157)"were"not"significantly"different""(ANOVA,"F"value"="1.801,"p=0.65)"for"S.#ra&.##Boxplots"represent"median"and"interquarBle"values;"extreme"outliers"are"not"shown.""Genes"unassigned"to"a"chromosome"were"excluded"from"the"analysis.""""

V1 V2 V3

02000

4000

""""""Genome"""""""""""""""""Clusters" Gene"family"clusters""

"

2000"

4000"

0"

(b)"

Intergen

ic"distance"(b

p)"

Page 85: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

Nature Genetics doi:10.1038/nb.3495

Dissected Rhabditophanes sp. (KR 3021) gonad. Distal is to the left, proximal to the right. The left inset shows condensed chromosomes in an oocyte, the right inset condensed chromosomes in an early embryo. The gonads were dissected and stained with DAPI as described105 Condensed chromosomes in oocytes and early embryos of seven different worms were counted using 3D reconstructions from confocal optical sections. Shown is a projection; the size bar is 50 µm. The chromosome number is 5 (meiotic bivalents) in oocytes and 10 in embryos. Although the 10 embryonic chromosomes cannot be counted in this particular overview projection, it illustrates that there are more than in the oocyte. KR3021 had been described as parthenogenetic106 or as gonochoristic107. We did not observe any males. Females maintained individually from early larvae stages successfully reproduced; sperm were not observed in the females. Together these observations are consistent with Rhabditophanes sp. (KR 3021) reproducing by meiotic parthenogenesis (as106) with n=5 chromosomes.

Page 86: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

84

URLS RepeatModeler, http://www.repeatmasker.org/RepeatModeler.html/; TransposonPSI, http://transposonpsi.sourceforge.net/; RepeatMasker, http://www.repeatmasker.org/; UniProt‟s protein naming guidelines,http://www.uniprot.org/docs/nameprot/; Evidence Code Ontology, http://www.evidenceontology.org/; Gene3D database, http://gene3d.biochem.ucl.ac.uk/Gene3D/; Trace Archive, http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?&cmd=retrieve&val=CENTER_PROJECT%20%3D%20%27RATTI%27%20and%20CENTER_NAME%20%3D%20%22SC%22/; SMALT, www.sanger.ac.uk/resources/software/smalt/; Roche protocols, www.454.com/; RepeatMasker, www.repeatmasker.org/;

Nature Genetics: doi:10.1038/ng.3495

Page 87: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

85

SUPPLEMENTARY REFERENCES 1. Viney, M.E. Developmental switching in the parasitic nematode Strongyloides ratti.

Proc. R. Soc. Lond. Ser. B 263, 201-8 (1996).

2. Viney, M.E., Matthews, B.E. & Walliker, D. Mating in the nematode parasite Strongyloides ratti: proof of genetic exchange. Proc. R. Soc. Lond. Ser. B 254, 213-9 (1993).

3. Stoltzfus, J.D., Massey, H.C., Jr., Nolan, T.J., Griffith, S.D. & Lok, J.B. Strongyloides stercoralis age-1: a potential regulator of infective larval development in a parasitic nematode. PLoS One 7, e38587 (2012).

4. Stoltzfus, J.D., Minot, S., Berriman, M., Nolan, T.J. & Lok, J.B. RNAseq analysis of the parasitic nematode Strongyloides stercoralis reveals divergent regulation of canonical dauer pathways. PLoS Negl. Trop. Dis. 6, e1854 (2012).

5. Lok, J.B. Strongyloides stercoralis: a model for translational research on parasitic nematode biology. WormBook, 1-18 (2007).

6. Eberhardt, A.G., Mayer, W.E. & Streit, A. The free-living generation of the nematode Strongyloides papillosus undergoes sexual reproduction. Int. J. Parasitol. 37, 989-1000 (2007).

7. Hino, A. et al. Karyotype and reproduction mode of the rodent parasite Strongyloides venezuelensis. Parasitology 141, 1736-45 (2014).

8. Barriere, A. & Felix, M.A. Isolation of C. elegans and related nematodes. WormBook, 1-19 (2014).

9. Grant, W.N. et al. Parastrongyloides trichosuri, a nematode parasite of mammals that is uniquely suited to genetic analysis. Int. J. Parasitol. 36, 453-66 (2006).

10. Kulkarni, A., Dyka, A., Nemetschke, L., Grant, W.N. & Streit, A. Parastrongyloides trichosuri suggests that XX/XO sex determination is ancestral in Strongyloididae (Nematoda). Parasitology 140, 1822-30 (2013).

11. Stiernagle, T. Maintenance of C. elegans. WormBook, 1-11 (2006).

12. Nemetschke, L., Eberhardt, A.G., Hertzberg, H. & Streit, A. Genetics, chromatin diminution, and sex chromosome evolution in the parasitic nematode genus Strongyloides. Curr. Biol. 20, 1687-96 (2010).

13. Bonfield, J.K. & Whitwham, A. Gap5 - editing the billion fragment sequence assembly. Bioinformatics 26, 1699-703 (2010).

14. Quinlan, A.R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr. Protoc. Bioinformatics 47, 11 12 1-11 12 34 (2014).

15. Thompson, F.J., Barker, G.L., Nolan, T., Gems, D. & Viney, M.E. Transcript profiles of long- and short-lived adults implicate protein synthesis in evolved differences in ageing in the nematode Strongyloides ratti. Mech. Ageing Dev. 130, 167-72 (2009).

Nature Genetics: doi:10.1038/ng.3495

Page 88: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

86

16. Baek, B.K., Islam, M.K. & Kim, J.H. Development of an in vitro culture method for harvesting the free-living infective larvae of Strongyloides venezuelensis. Korean J. Parasitol. 36, 15-22 (1998).

17. Kozarewa, I. et al. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat. Methods 6, 291-5 (2009).

18. Park, N., Shirley, L., Gu, Y., Keane, T. M., Swerdlow, H. & Quail, M. An improved approach to mate-paired library preparation for Illumina sequencing. Methods in Next-Generation Sequencing 1, 10 (2013).

19. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-80 (2005).

20. Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117-23 (2009).

21. Bonfield, J.K., Smith, K. & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res. 23, 4992-9 (1995).

22. Delcher, A.L., Phillippy, A., Carlton, J. & Salzberg, S.L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478-83 (2002).

23. Tsai, I.J., Otto, T.D. & Berriman, M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 11, R41 (2010).

24. Otto, T.D., Sanders, M., Berriman, M. & Newbold, C. Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26, 1704-7 (2010).

25. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821-9 (2008).

26. Hunt, M. et al. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 14, R47 (2013).

27. Nemetschke, L., Eberhardt, A.G., Viney, M.E. & Streit, A. A genetic map of the animal-parasitic nematode Strongyloides ratti. Mol. Biochem. Parasitol. 169, 124-7 (2010).

28. Simpson, J.T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549-56 (2012).

29. Gremme, G., Steinbiss, S. & Kurtz, S. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 10, 645-56 (2013).

30. Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578-9 (2011).

31. Boetzer, M. & Pirovano, W. Toward almost closed genomes with GapFiller. Genome Biol. 13, R56 (2012).

Nature Genetics: doi:10.1038/ng.3495

Page 89: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

87

32. Kajitani, R. et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384-95 (2014).

33. Huang, S. et al. HaploMerger: reconstructing allelic relationships for polymorphic diploid genome assemblies. Genome Res. 22, 1581-8 (2012).

34. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-402 (1997).

35. Parra, G., Bradnam, K., Ning, Z., Keane, T. & Korf, I. Assessing the gene space in draft genomes. Nucleic Acids Res. 37, 289-97 (2009).

36. Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460-1 (2010).

37. Stanke, M., Schoffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006).

38. Slater, G.S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).

39. Otto, T.D., Dillon, G.P., Degrave, W.S. & Berriman, M. RATT: Rapid Annotation Transfer Tool. Nucleic Acids Res. 39, e57 (2011).

40. Carver, T., Harris, S.R., Berriman, M., Parkhill, J. & McQuillan, J.A. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 28, 464-9 (2012).

41. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562-78 (2012).

42. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

43. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511-5 (2010).

44. Parkinson, J., Whitton, C., Schmid, R., Thomson, M. & Blaxter, M. NEMBASE: a resource for parasitic nematode ESTs. Nucleic Acids Res. 32, D427-30 (2004).

45. Benson, D.A. et al. GenBank. Nucleic Acids Res. 43, D30-5 (2015).

46. Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).

47. Smith, C.D. et al. Improved repeat identification and masking in Dipterans. Gene 389, 1-9 (2007).

Nature Genetics: doi:10.1038/ng.3495

Page 90: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

88

48. Kohany, O., Gentles, A.J., Hankus, L. & Jurka, J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics 7, 474 (2006).

49. Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435-9 (2006).

50. Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y.O. & Borodovsky, M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 18, 1979-90 (2008).

51. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).

52. She, R. et al. genBlastG: using BLAST searches to build homologous gene models. Bioinformatics 27, 2141-3 (2011).

53. Harris, T.W. et al. WormBase 2014: new views of curated biology. Nucleic Acids Res. 42, D789-93 (2014).

54. Nakamura, Y., Cochrane, G., Karsch-Mizrachi, I. & Collaboration, I.N.S.D. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 41, D21-4 (2013).

55. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204-12 (2015).

56. Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061-7 (2007).

57. Fischer, S. et al. Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr. Protoc. Bioinformatics Chapter 6, Unit 6 12 1-19 (2011).

58. Finn, R.D., Clements, J. & Eddy, S.R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29-37 (2011).

59. Logan-Klumpler, F.J. et al. GeneDB - an annotation database for pathogens. Nucleic Acids Res. 40, D98-108 (2012).

60. Vilella, A.J. et al. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19, 327-35 (2009).

61. Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213-21 (2015).

62. Blaxter, M.L. et al. A molecular evolutionary framework for the phylum Nematoda. Nature 392, 71-5 (1998).

63. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059-66 (2002).

Nature Genetics: doi:10.1038/ng.3495

Page 91: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

89

64. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312-3 (2014).

65. Keller, O., Odronitz, F., Stanke, M., Kollmar, M. & Waack, S. Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinformatics 9, 278 (2008).

66. Hammesfahr, B., Odronitz, F., Muhlhausen, S., Waack, S. & Kollmar, M. GenePainter: a fast tool for aligning gene structures of eukaryotic protein families, visualizing the alignments and mapping gene structures onto protein structures. BMC Bioinformatics 14, 77 (2013).

67. Felsenstein, J. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5, 164-166 (1989).

68. Haas, B.J., Delcher, A.L., Wortman, J.R. & Salzberg, S.L. DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20, 3643-6 (2004).

69. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

70. Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639-45 (2009).

71. Okimoto, R., Macfarlane, J.L., Clary, D.O. & Wolstenholme, D.R. The mitochondrial genomes of two nematodes, Caenorhabditis elegans and Ascaris suum. Genetics 130, 471-98 (1992).

72. Hu, M., Chilton, N.B. & Gasser, R.B. The mitochondrial genome of Strongyloides stercoralis (Nematoda) - idiosyncratic gene order and evolutionary implications. Int. J. Parasitol. 33, 1393-408 (2003).

73. Hahn, C., Bachmann, L. & Chevreux, B. Reconstructing mitochondrial genomes directly from genomic next-generation sequencing reads - a baiting and iterative mapping approach. Nucleic Acids Res. 41, e129 (2013).

74. Bernt, M. et al. MITOS: improved de novo metazoan mitochondrial genome annotation. Mol. Phylogenet. Evol. 69, 313-9 (2013).

75. Katoh, K. & Standley, D.M. MAFFT: iterative refinement and additional methods. Methods Mol. Biol. 1079, 131-46 (2014).

76. Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43, D1049-56 (2015).

77. Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236-40 (2014).

78. Alexa, A., Rahnenfuhrer, J. & Lengauer, T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600-7 (2006).

Nature Genetics: doi:10.1038/ng.3495

Page 92: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

90

79. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-9 (2009).

80. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-2 (2010).

81. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-40 (2010).

82. Rawlings, N.D., Barrett, A.J. & Bateman, A. MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 40, D343-50 (2012).

83. Chang, J.M., Di Tommaso, P. & Notredame, C. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol. Biol. Evol. 31, 1625-37 (2014).

84. Abascal, F., Zardoya, R. & Posada, D. ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21, 2104-5 (2005).

85. Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-90 (2006).

86. Schwanhausser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337-42 (2011).

87. Soblik, H. et al. Life cycle stage-resolved proteomic analysis of the excretome/secretome from Strongyloides ratti - identification of stage-specific proteases. Mol. Cell. Proteomics 10, M111 010157 (2011).

88. Chambers, M.C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918-20 (2012).

89. Ishihama, Y. et al. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell. Proteomics 4, 1265-72 (2005).

90. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573-80 (1999).

91. Warburton, P.E., Giordano, J., Cheung, F., Gelfand, Y. & Benson, G. Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Res. 14, 1861-9 (2004).

92. Cherry, J.M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700-5 (2012).

93. Bolla, R.I. & Roberts, L.S. Gametogenesis and chromosomal complement in Strongyloides ratti (Nematoda: Rhabdiasoidea). J. Parasitol. 54, 849-55 (1968).

94. Hammond, M.P. & Robinson, R.D. Chromosome complement, gametogenesis, and development of Strongyloides stercoralis. J. Parasitol. 80, 689-95 (1994).

Nature Genetics: doi:10.1038/ng.3495

Page 93: The Genomic Basis of Parasitism in the …The Genomic Basis of Parasitism in the Strongyloides Clade of Nematodes Vicky L. Hunta, Isheng J. Tsaia, Avril Coghlana, Adam J. Reida, Nancy

91

95. Albertson, D.G., Nwaorgu, O.C. & Sulston, J.E. Chromatin diminution and a chromosomal mechanism of sexual differentiation in Strongyloides papillosus. Chromosoma 75, 75-87 (1979).

96. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012-8 (1998).

97. Blaxter, M. Genes and genomes of Necator americanus and related hookworms. Int. J. Parasitol. 30, 347-55 (2000).

98. Liu, Q.L. & Williamson, V.M. Host-Specific Pathogenicity and Genome Differences between Inbred Strains of Meloidogyne hapla. J. Nematol. 38, 158-64 (2006).

99. Hammond, M.P. & Bianco, A.E. Genes and genomes of parasitic nematodes. Parasitol. Today 8, 299-305 (1992).

100. Goldstein, P. & Moens, P.B. Karyotype analysis of Ascaris lumbricoides var. suum. Male and female pachytene nuclei by 3-D reconstruction from electron microscopy of serial sections. Chromosoma 58, 101-11 (1976).

101. Sakaguchi, Y., Tada, I., Ash, L.R. & Aoki, Y. Karyotypes of Brugia pahangi and Brugia malayi (Nematoda: Filarioidea). J. Parasitol. 69, 1090-3 (1983).

102. Kikuchi, T. et al. Genomic insights into the origin of parasitism in the emerging plant pathogen Bursaphelenchus xylophilus. PLoS Pathog. 7, e1002219 (2011).

103. Spakulova, M., Kralova, I. & Cutillas, C. Studies on the karyotype and gametogenesis in Trichuris muris. J. Helminthol. 68, 67-72 (1994).

104. Felsenstein, J. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266, 418-27 (1996).

105. Kulkarni, A., Holz, A., Rodelsperger, C., Harbecke, D. & Streit, A. Differential chromatin amplification and chromosome complements in the germline of Strongyloididae (Nematoda). Chromosoma, Advanced online publication, doi:10.1007/s00412-015-0532-y (2015).

106. Felix, M.A. et al. Evolution of vulva development in the Cephalobina (Nematoda). Dev. Biol. 221, 68-86 (2000).

107. Dorris, M., Viney, M.E. & Blaxter, M.L. Molecular phylogenetic analysis of the genus Strongyloides and related nematodes. Int. J. Parasitol. 32, 1507-17 (2002).

Nature Genetics: doi:10.1038/ng.3495