31
Genomics of the capybara, two emblematic Colombian species María José Gómez-Hughes¹, Santiago Herrera-Álvarez1,2, Andrew J. Crawford¹ ¹Department of Biological Sciences, Universidad de los Andes, Bogotá, 111711, Colombia. ²Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA. Abstract Capybaras, which are native to South America, are not only the largest rodents in the world, but they also have a number of other characteristics that make them unique. They are semi-aquatic, grazing mammals and live in large groups where females engage in communal breeding. Males communally defend the territory through scent-marking with a specialized gland called the morillo and with two anal glands. Here we present the first genome assembly and annotation for the lesser capybara, Hydrochoerus isthmius, as well as the first transcriptome assembly for the capybara, H. hydrochaeris, both of which are comparable in completeness with previously published rodent genomes, and compared them with the previously published genome assembly for the capybara. We found evidence of reduction on the effective population size of both species, as well as big regions of genomic rearrangement with the guinea pig. Our phylogenetic analysis is consistent with previous phylogenies reported for the suborder Hystrichomorpha, but species related there is evidence for the capybara being a paraphyletic species. We hope that this study contributes for conservation efforts on these species, as well as a better understanding of all the characteristics that make them unique. Resumen Los chigüiros, nativos a América del Sur, no solamente son los roedores más grandes del mundo sino también tienen otras características que los hacen únicos. Son especies de mamíferos semiacuaticas que pastean y viven en grandes grupos en los que las hembras crían comunitariamente a sus crías y los machos defienden sus territorios mediante marcajes con el morillo, una glándula especializada, y dos glándulas anales. Aquí presentamos el primer ensamblaje genómico del chigüiro menor, Hydrochoerus isthmius, y el primer transcriptoma del chigüiro, H. hydrochaeris, como también comparaciones con el genoma del chigüiro publicado anteriormente. Encontramos evidencia de reducciones poblacionales de ambas especies, como también rearreglos genómicos en comparación con el conejillo de indias. Nuestro análisis filogenético es consistente con análisis publicados previamente para el suborder Hystricomorpha, pero hay evidencia para la parafilia del chigüiro. Esperamos que este estudio contribuya a esfuerzos de conservación en estas especies, como también a un mejor entendimiento de esas características que los hacen únicos. Keywords: Hydrochoerus sp., chigüiro, populational genomics, conservation genomics, 10X genomics, genome assembly. Ethics Statement: Tissue samples of the lesser capybara and capybara were obtained under research and collecting permit No. 1177 issued to the Universidad de los Andes by the Autoridad Nacional de Licencias Ambientales (ANLA; National Authority of Environmental Permits). Anesthetic and euthanization protocols used were approved by the Universidad de los Andes’ Comit Institucional de Cuidado y Uso de Animales de Laboratorio (CICUAL; approval number C.FUA_14-023).

Genomics of the capybara, two emblematic Colombian species

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genomics of the capybara, two emblematic Colombian species

Genomics of the capybara, two emblematic Colombian species

María José Gómez-Hughes¹, Santiago Herrera-Álvarez1,2, Andrew J. Crawford¹

¹Department of Biological Sciences, Universidad de los Andes, Bogotá, 111711, Colombia.

²Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA.

Abstract

Capybaras, which are native to South America, are not only the largest rodents in the world, but they

also have a number of other characteristics that make them unique. They are semi-aquatic, grazing

mammals and live in large groups where females engage in communal breeding. Males communally

defend the territory through scent-marking with a specialized gland called the morillo and with two

anal glands. Here we present the first genome assembly and annotation for the lesser capybara,

Hydrochoerus isthmius, as well as the first transcriptome assembly for the capybara, H. hydrochaeris,

both of which are comparable in completeness with previously published rodent genomes, and

compared them with the previously published genome assembly for the capybara. We found evidence

of reduction on the effective population size of both species, as well as big regions of genomic

rearrangement with the guinea pig. Our phylogenetic analysis is consistent with previous phylogenies

reported for the suborder Hystrichomorpha, but species related there is evidence for the capybara

being a paraphyletic species. We hope that this study contributes for conservation efforts on these

species, as well as a better understanding of all the characteristics that make them unique.

Resumen

Los chigüiros, nativos a América del Sur, no solamente son los roedores más grandes del mundo sino

también tienen otras características que los hacen únicos. Son especies de mamíferos semiacuaticas

que pastean y viven en grandes grupos en los que las hembras crían comunitariamente a sus crías y los

machos defienden sus territorios mediante marcajes con el morillo, una glándula especializada, y dos

glándulas anales. Aquí presentamos el primer ensamblaje genómico del chigüiro menor,

Hydrochoerus isthmius, y el primer transcriptoma del chigüiro, H. hydrochaeris, como también

comparaciones con el genoma del chigüiro publicado anteriormente. Encontramos evidencia de

reducciones poblacionales de ambas especies, como también rearreglos genómicos en comparación

con el conejillo de indias. Nuestro análisis filogenético es consistente con análisis publicados

previamente para el suborder Hystricomorpha, pero hay evidencia para la parafilia del chigüiro.

Esperamos que este estudio contribuya a esfuerzos de conservación en estas especies, como también a

un mejor entendimiento de esas características que los hacen únicos.

Keywords: Hydrochoerus sp., chigüiro, populational genomics, conservation genomics, 10X

genomics, genome assembly.

Ethics Statement:

Tissue samples of the lesser capybara and capybara were obtained under research and collecting

permit No. 1177 issued to the Universidad de los Andes by the Autoridad Nacional de Licencias

Ambientales (ANLA; National Authority of Environmental Permits). Anesthetic and euthanization

protocols used were approved by the Universidad de los Andes’ Comite Institucional de Cuidado

y Uso de Animales de Laboratorio (CICUAL; approval number C.FUA_14-023).

Page 2: Genomics of the capybara, two emblematic Colombian species

Introduction

The Order Rodentia is the most diverse group of mammals in the world in terms of species and

ecological diversity as well as morphological variation (Samuels, 2009; Fabre et al., 2012). Rodents

comprise almost 40% of all mammal species (Burgin et al., 2018) and inhabit almost all terrestrial

biomes (Hafner & Hafner, 1988). There are currently five recognized suborders of rodents:

Sciuromorpha (dormices, mountain beavers, marmots, squirrels and squirrel-like rodents),

Castorimorpha (beavers, kangaroo rats and pocket gophers), Myomorpha (hamsters, jerboas, mice,

rats and mouse-like rodents), Anomaluromorpha (scaly tailed squirrels and springhares), and

Hystricomorpha (chinchillas, guinea pigs, gundis, porcupines and others), which are composed of 33

families (Wilson & Reeder, 2005).

Among these groups, in the order Hystricomorpha and family Caviidae (Wilson & Reeder,

2005), are the capybaras (genus: Hydrochoerus). Capybaras are known for being the largest extant

rodents (Figure 1A; Moreira et al., 2013), their semi-aquatic habits (Macdonald, 1981), and for being

social animals with communal breeding and communal defense (Macdonald, 1981). Capybaras are

also known for having distinctive feeding and scent marking behaviors. Feeding related, capybaras

graze on both aquatic and terrestrial herbaceous vegetation that undergo multiple passes through their

digestive tracts, either via regurgitation or cropography (Lord, 1994). Scent marking related,

capybaras possess two types of scent marking glands - the morillo, a protuberance that males express

in the top of their snouts which size can be predictive of dominance (Rosenfield et al., 2019), and two

anal glands - and is the social interaction most seen in them (Emilio & Macdonald, 1994).

There are currently two described species of capybaras: the capybara, H. hydrochaeris, and the

lesser capybara, H. isthmius (Mones, 1991), inhabiting eastern Colombia, eastern Venezuela, the

Guyanas, Ecuador, Peru, northeastern Argentina and Uruguay, and Panama, western Colombia and

western Venezuela, respectively (Figure 1B; Reid, 2016; Delgado & Emmons, 2016). However, some

dispute exists as to whether there are two or only one species of capybaras, with some still referring to

the lesser capybara as a subspecies (see Correa & Jorgenson, 2009 and Carrascal, Linares & Chacón,

2011), but classified as its own species in the database of mammalian taxonomy (Wilson & Reeder,

2005) and by the International Union for Conservation of Nature and Natural Resources (IUCN;

Delgado & Emmons, 2016), as well as to where in the Hystricomorpha phylogeny they are localized

(see Upham & Patterson, 2015; Álvarez, Arévalo, & Verzi, 2017; Rowe & Honeycutt, 2002).

Page 3: Genomics of the capybara, two emblematic Colombian species

Figure 1. (A) Relative size of the two species of capybara as compared to a 1.75 m tall human. The lesser

capybara (Hydrochoerus isthmius) is shown in purple on the left while the capybara (Hydrochoerus

hydrochaeris) is shown in blue on the right. (B) Geographic ranges of the capybara (blue) and the lesser

capybara (purple) according to the IUCN (Reid, 2016; Delgado & Emmons, 2016). (C) A picture showing a

male capybara with its morillo indicated by the red arrow. All images used were labeled for noncommercial use

with modifications from Wiki Commons.

Currently, the capybara is listed as a species of Least Concern by the IUCN (Reid, 2016), but

some concerns have arisen over the years and substantial population declines have been noted in this

species (Corriale & Herrera, 2014). The lesser capybara is listed as Data Deficient due to lack of

baseline studies on the status of these populations (Delgado & Emmons, 2016). This species has been

neglected in studies of conservation despite being harvested for meat, leather and fat (Pinheiro &

Moreira, 2013) and despite the threats to its native habitat (Aldana-Domínguez, Vieira-Muñoz &

Bejarano, 2013).

In this paper we present the first genome assembly for the lesser capybara and the first

transcriptome assembly for the capybara, as well as new analyses for the previously published

capybara genome assembly by Herrera-Álvarez et al. (2018). We compare demographic changes over

time of both species, compare synteny between the two capybara species and between each species

and the guinea pig (Cavia porcellus), evaluate distinctiveness of the capybaras as independent

monophyletic groups, assess the position of Hydrochoerus in the Hystricomorpha tree, and analyze

differentially expressed genes among tissues from the capybara.

Page 4: Genomics of the capybara, two emblematic Colombian species

Materials and Methods

Genome assembly of the capybara

Tissue from a wild-caught but captive-raised capybara (H. hydrochaeris), reportedly from

Bolivia, was donated by the San Diego Zoo’s Frozen Zoo to the 200 Mammalian Genomes Project

led by the Broad Institute, which then sequenced and assembled a draft genome using the sequencing

and assembly method of DISCOVAR de novo (Weisenfeld et al., 2014). This assembly was then

‘upgraded’ using Chicago libraries provided by Dovetail Genomics (Putnam et al., 2016) and financed

by Colciencias. Details on the final assembly can be found in Herrera-Álvarez et al. (2018).

Genome assembly of the lesser capybara

Tissue samples of the lesser capybara were collected from one juvenile H. isthmius from San

Juan del Carare, Santander, Colombia, on 22 June, 2017, that was subsequently accessioned into the

mammals collection of the Museo Historia Natural ANDES, Universidad de los Andes, Bogotá,

Colombia (field number AJC 7100, voucher number ANDES-M 2300). This sample was sequenced

using 10X Genomics linked reads technology in two lanes of Illumina HiSeq X10. The resulting reads

were run through longranger v2.2.2 (10X Genomics) to estimate the genome size, heterozygosity, and

to process the barcodes.

These reads were assembled with Supernova v2.0.1 (Weisenfeld et al., 2017), an assembler

created by 10X Genomics that uses a progressively larger contigs approach and its own trimming step

to create phased scaffolds from the reads. We included the mkoutput pseudohap option to visualize

only one haplotype on the resulting assembly.

To enhance this assembly we used the following three scripts. 1) Tigmint v1.1.2 was used to

produce an assembly that is both more contiguous and more correct by comparing the alignment of

linked reads to the draft assembly, to correct for possible mis-assemblies (Jackman et al., 2018). 2)

Arcs v1.0.6 was used to add an additional scaffold step by organizing the assembly with information

included in the linked reads and to join those scaffolds with more probability of being together to

create a more contiguous assembly (Yeo et al., 2017). 3) Sealer v2.0.2 was used to identify intra-

scaffold gaps in the draft assembly, search for flanking sequences, and then to fill the gaps by

realigning the raw reads to the assembly (Paulino et al., 2015). Sealer navigates de Bruijn graphs via

bloom filters based on k values and we chose k values of 64, 80, 96, 112, and 128, since this would

give us a range that could help us close gaps on areas of low coverage with the lower values, and

areas of high repetition levels with the higher values (Paulino et al., 2015).

Between each of the aforementioned steps, and at the end, we ran QUAST v5.0.2 (Gurevich

et al., 2013) to measure enhancement of quality metrics of the assembly. These metrics included (1)

contig sizes: number of contigs, length of the largest contig, contig and scaffold Nx (the length in base

pairs such that the sum of all contigs larger than said length add up to e x% of the length og the whole

Page 5: Genomics of the capybara, two emblematic Colombian species

assembly, e.g., N50), and (2) Comparison to the domestic guinea pig reference genome assembly

(RefSeq accession number: GCF_000151735.1; release: Cavpor 3.0) in terms of GC content (%),

number of mismatches per 100 kilobase (Kb), and number of indels per 100 Kb (Gurevich et al.,

2013; Table 1).

Finally, to assess genome completeness, we ran BUSCO v3.0.2 using the Vertebrate dataset

(Waterhouse et al., 2018) and compared the percentage of BUSCO genes recovered against the

genome assemblies of other rodent species published in Ensembl (Table 2).

Transcriptome sequencing, assembly, and functional annotation

Transcriptomic data were obtained from eleven tissues representing two H. hydrochaeris

individuals of either sex (Table 3). Tissues were preserved in Nucleic Acid Preservation (NAP;

Camacho-Sánchez et al., 2013) buffer to avoid RNA degradation. RNA was extracted with standard

TRI Reagent® Solution (Ambion Inc., Austin, Texas, USA) and then cleaned using the RNeasy Plus

Mini Kit (Qiagen. Hilden, Germany) and diluted to a final volume of 30μl with nuclease-free water.

Quantity of extracted RNA for library construction was measured with the Qubit® RNA HS Assay

kit. Complementary DNA (cDNA) libraries were constructed with the Illumina TruSeq v. 2 kit using

half reactions. Quality of cDNA libraries was assessed using Agilient 2100 BioAnalyzer and Agilient

High Sensitivity DNA kit. The 11 libraries were barcoded and run together in paired-end mode on one

lane of an Illumina Hiseq 2000.

For the transcriptome assembly, we used all reads from the 11 sequenced libraries. We first

trimmed the reads with trimmomatic v0.39 (Bolger, Lohse & Usadel, 2014), filtered with the FASTX-

toolkit v0.0.14, normalized based on the median coverage, and trimmed unreliable k-mers using the

khmer v1 digital normalization algorithm (Crusoe et al., 2015; Brown et al., 2012). We then used

Trinity to assemble the transcriptome from the remaining reads (Grabherr et al., 2011). To evaluate

the transcriptome assembly quality based on its completeness we used Trinity scripts and BUSCO

v3.0.2 (Waterhouse et al., 2018).

To functionally annotate the transcriptome, we extracted the longest open reading frames and

predicted the most likely coding regions with Transdecoder v3.0.0 (Haas & Papanicolaou, 2015).

Then we used Trinotate to functionally annotate the predicted polypeptides and to create a database

for navigating these data (following Bryant et al., 2017). We used BLAST v2.9.0 to search for

homology hits against the UniProt Swiss-Prot database (UniProt Consortium, 2018), and identify to

which Pfam protein family each transcript belonged (El-Gebali et al., 2018) using profile hidden

Markov Models with HMMER v3.2.1 (Mistry et al., 2013). We predicted signal peptides using a deep

neural network approach with SignalP v5.0 (Armenteros et al., 2019), predicted transmembrane

protein domains using hidden Markov models with tmHMM v2.0 (Krogh et al., 2001), and assigned

inferred proteins to Eggnog functional categories (Huerta-Cepas et al., 2015) and to gene ontology

categories (GO; Ashburner et al., 2000) using BLAST v2.9.0.

Page 6: Genomics of the capybara, two emblematic Colombian species

Genomic repeat masking

The results of repeat masking and annotation of the capybara were taken from Herrera-

Álvarez et al. (2018), and a similar approach was taken for the lesser capybara. To repeat mask the

lesser capybara assembly, we used RepeatMasker v4.0.9_p2 (Smit, Hubley & Green, 2015) specifying

“rodentia” as the species to guide the masking using repeat evidence from other rodents. For the type

of masking, we chose a soft-masked approach which gave us the possibility to visualize what the

repeat subsequences were, but without them interfering on downstream analyses.

Genome annotation and gene content

For annotating the lesser capybara genome, we selected three high-quality, annotated genome

assemblies from representative rodent species. We used the Maker v2.31 pipeline (Holt & Yandell,

2011) based on proteomes from the guinea pig (Cavia porcellus, Cavpor 3.0), the house mouse (Mus

musculus; GRCm38.p6), and the common rat (Rattus novergicus; Rnor6.0) to guide the annotation.

We downloaded the proteomes from the Ensembl release 97 and clustered them with CD-HIT v4.6.1

into a single non-redundant file (Li, Jaroszewski, & Godzik, 2001). The capybara genome annotation

was taken from Herrera-Álvarez et al. (2018). We used the Swiss-Prot reviewed database (UniProt

Consortium, 2018) to add functionality to the annotations of both species by homology hits found by

Blastp v2.9.0. Additionally, we used InterProScan v5.36-75 (Jones et al., 2014) to classify the

annotated genes into Pfam protein families (El-Gebali et al, 2018).

Microsatellites are a class of short tandem repeat (STR) motifs that are frequently used as

Mendelian markers in population genetic and kinship studies (Jame & Lagoda, 1996). Here we define

STRs as six or more dinucleotide repeats, and five or more repeats ranging from tri- to dodeca-

nucleotide repeats. To annotate microsatellites for both species, we used the MIcroSAtellite

identification tool (MISA-web; Beier et al., 2017).

Mitochondrial genome

We created a DNA database of the capybara and the lesser capybara genomes independently

and used the guinea pig mitochondrial genome (Accession number: NC_000884) as a query against

the database with Blastn v2.9.0. We then annotated the most probable scaffold to obtain the

mitochondrial genome sequence of each species using MITOS WebServer pipeline (Bernt et al.,

2013). To visualize the mitogenomes we used the CGView Server (Grant & Stothard, 2008).

Demography

We used pairwise sequentially Markovian coalescent (PSMC; Li & Durbin, 2011) to infer

how the effective population size (Ne ) may have changed over recent history. Briefly, this algorithm

Page 7: Genomics of the capybara, two emblematic Colombian species

reconstructs the distribution of times to most recent common ancestor (TMRCA) along chromosomes

by examining the density of heterozygous sites (Li & Durbin, 2011). To do this, we first indexed the

assemblies, to make alignments faster and less computational exhaustive, and mapped the raw reads

back to it with bwa v0.7.4 (Li & Durbin, 2009). We then sorted each alignment by their order and

converted it from a BAM to a VCF file with SAMtools v1.8 (Li et al., 2009), called the SNPs and

indels with bcftools v1.8 (Li, 2011), and transformed this file to a fastq file with vcftools v4.2.0

(Danecek et al, 2011). We then used this file to estimate the parameters of the PSMC model, with 100

bootstrap models, and a recombination parameter of “4+25*2+4+6” with psmc v0.6.5 (Li & Durbin,

2011).

Synteny between the capybaras and the common guinea pig

To identify genomic regions containing large rearrangements in the capybaras relative to the

guinea pig, we performed global pairwise alignments between the two capybara species, between the

capybara and the guinea pig, and between the lesser capybara and the guinea pig using bwa v0.7.4 (Li

& Durbin, 2009). To visualize these alignments, we drew 100 Kb windows where the sequences

would align between two half circles representing each of the assemblies using Circos v0.69-8

(Krzywinski et al., 2009).

Genetic diversity within Hydrochoerus

To estimate genomic divergence between northern and southern H. hydrochaeris relative to

H. isthmius, we ran a phylogenomic analysis of protein-coding sequences derived from genomic and

transcriptomic analyses, with the guinea pig as outgroup. To minimize possible problems with

paralogy, we analyzed only those genes included in the BUSCO Vertebrate orthologs dataset

(Waterhouse et al., 2018) which were obtained using BUSCO v3.0.2 from either the reference

genome assembly (H. isthmius, southern H. hydrochaeris from Bolivia) or the transcriptome de novo

assembly (northern H. hydrochaeris from the Colombian Llanos and the guinea pig transcriptome

(Cavia porcellus; genome version: Cavpor3.0; accession number: GCA_000151735.1). All BUSCO

genes that were found complete on the four datasets were aligned independently with MAFFT v7.309

using a BLOSUM 62 matrix (Katoh & Standley, 2013). Alignments were then trimmed with trimAl

v1.4 (Capella-Gutiérrez, Silla-Martínez & Gabaldón, 2009). To infer a species tree, we concatenated

the resulting alignments with FASconCAT-G v1.04 (Kück & Meusemann, 2010) and used IQtree

v1.6.10 to select the best fit model of substitution, for all genes, based on a corrected Akaike

information criterion (AICc) and to implement a partitioned likelihood analysis (Nguyen et al, 2014;

Chernomor, von Haeseler & Minh, 2016). Statistical support for relationships was estimated using

1000 non-parametric bootstraps for sites within partitions and 1000 likelihood ratio tests . The

resulting tree was visualized in iTOL (Letunic & Bork, 2019).

Page 8: Genomics of the capybara, two emblematic Colombian species

Phylogenomic analyses

To verify the position of Hydrochoerus on the Hystrichomorpha phylogeny, we downloaded

from Ensembl all available proteomes of Hystrichomorpha, and included mouse as an outgroup for a

total of 9 species (Table 2) and used Orthofinder v2.3.3 to search for orthologs (Emms, & Kelly,

2019). Then we extracted all single copy orthologs that were found in all nine species and performed

a pre-alignment quality filter with PREQUAL v1.02 to identify and filter non-homologous sequences

(Whelan, Irisarri & Burki, 2018). We performed a multiple sequence alignment with MAFFT v7.309

assuming a BLOSUM 62 matrix in each orthologue independently (Katoh & Standley, 2013), and

then trimmed the alignments for poorly aligned regions with trimAl v1.4 to maintain only the most

reliable alignments (Capella-Gutiérrez, Silla-Martínez & Gabaldón, 2009). To infer phylogenetic

relationships, we used two approaches. First, we concatenated all the alignments with FASconCAT-G

v1.04 (Kück & Meusemann, 2010), constructed a maximum likelihood tree with RAxML v8.2.12,

using a GAMMA model for rate heterogeneity that estimates the alpha parameter, and 100 bootstraps

for statistical support (Stamakis, 2014). Second, we estimated a species tree via a Bayesian approach

using MCMCTree implemented in PAML v4.9 (Yang, 2007). We discarded the first 2000 generations

of the Markov chain as a burnin and then sampled 20,000 trees one every 20 iterations. The timetree

was calibrated by constraining the root to a temporal interval of 68 - 78 million years ago,

corresponding to the TMRCA of the Hystricomorpha group and the mouse (Hedges, Dudley &

Kumar, 2006). The sample of posterior trees was used to generate a Hessian matrix using CODEML

in PAML v4.9 and assuming a WAG+GAMMA model. The Hessian matrix was used to run

MCMCTree again to obtain a Bayesian consensus tree which was visualized using the R package

MCMCTreeR (Puttick, 2019). As a check on the MCMCTree results, we used a second species-tree

approach that takes into account the individual history of each gene. ,We used RAxML v8.2.12 to

infer a maximum likelihood tree for each gene independently, also with a GAMMA model and 100

bootstraps per gene. The resulting gene trees were used as input to infer a species tree using NJst (Liu

& Yu, 2011).

Page 9: Genomics of the capybara, two emblematic Colombian species

Results

Sequencing and genome assembly of the lesser capybara

The sequencing of the lesser capybara, Hydrochoerus isthmius, resulted in a total of 1.751

billion reads, each of length 150 bp. From these reads it was inferred that

the lesser capybara genome has a size of 2.7 Gb long and a heterozygosity

of 0.24% (Table 1). The Supernova assembly (step 1 of the assembly

process) had a total size of 2.5 Gb, counting only scaffolds ≥ 10,000 bp,

and a scaffold N50 of 694 Kb (Table 4). Quast assembly statistics and the enhancement of

these throughout the successive steps of the assembly process (see Materials and Methods) are

reported in Table 4. The final lesser capybara genome draft contained 18,502 contigs with a contig

N50 of 232 Kb plus 7,702 scaffolds with an N50 of 787 Kb, and a GC content of 40.01%.

As a measure of genome completeness we used the fraction of genes recovered in our

assemblies out of a total of 3,023 BUSCO genes in the vertebrates data set. For the lesser capybara

we recovered, 2,563 genes that were assembled completely (84%) and 227 fragmented genes (7.5%),

with 233 genes missing (7.7%), making the lesser capybara assembly comparable to other rodent

genome assemblies in Ensembl (Figure 2).

Figure 2. Percentage of Vertebrate BUSCO genes recovered in published rodent genome assemblies (plus

rabbit), as a measure of assembly completeness. The genome assemblies of H. hydrochaeris (Herrera-Álvarez

et al., 2018) and H. isthmius (this study) are indicated in bold inside the rectangle.

Transcriptome sequencing, assembly, and functional annotation

Page 10: Genomics of the capybara, two emblematic Colombian species

A total of 882.24 million reads with a length of 150 bp where sequenced from the 11

RNAseqlibraries made from Hydrochoerus hydrochaeris. After the quality filters, normalization, and

trimming steps (see Materials and Methods) a total of 140 million reads were kept and subsequently

used for transcriptome assembly. The resulting transcriptome had a total of 994,100 transcripts

belonging to 768,228 genes. Transcriptome GC content was estimated to be 46.58% and a N50 of 704

bp, with an average length of 574.17 bp, taking into account only the longest isoform per gene.

Among the Eggnog functional categories, the most common were translation, ribosomal structure and

biogenesis (14.9%) followed by amino acid transport and metabolism (9.1%), and energy production

and conversion (8%; Figure 3A). The Pfam protein families most represented were immunoglobulin

V-set domain, immunoglobulin domain and Zinger finger C2H2 type with 18.2%, 8.5% and 5.4%

respectively (Figure 3B). And the three most represented GO terms were cellular nitrogen compound

metabolic process (5.1%), DNA metabolic process (3.7%) and biosynthetic process (3.2%; Figure

3C). Among the 768,228 genes, we predicted 273,824 coding regions and of these 5.3% (14,553 of

273,824) were predicted to have signal peptides.

Figure 3. Capybara (Hydrochoerus hydrochaeris) transcriptome functional annotation. (A) Functional

categories from Eggnog mapping. (B) Protein families from Pfam. (C) 25 Gene ontology categories most

represented.

Page 11: Genomics of the capybara, two emblematic Colombian species

Genomic repeat masking

Lesser capybara: The repeats identified by RepeatMasker occupied 27.72% of the total

assembly. These repeats belonged mostly to the LINEs repeat class (51.9%; long interspersed

elements), followed by LTRs (17.2%; long terminal repeats) and SINEs (15.4%; short interspersed

elements) (Figure 4A-B). Almost half of the repetitive elements (47.08%) were LINEs from the

subclass LINE-1, which is consistent for what is reported for humans (45.55%; Lander et al., 2001), in

mice (48.71%; Waterson et al., 2002) and in other rodents (Figure 4C; Smit, Hubley & Green, 2015).

Figure 4. Frequency of classes (A) and subclasses (B) of repetitive elements within the lesser capybara

(Hydrochoerus isthmius) genome assembly. (A) SINEs: short interspersed elements, LINEs: long interspersed

elements, LTR: long terminal repeats, others: satellites, simple repeats, small RNA, and low complexity repeats.

(B) ALU/B1, B2 - B4, and MIRs: subclasses of SINEs; LINE1, LINE2, and L3/CR1: subclasses of LINEs;

ERVL, ERVL-MaLRs, ERV class I, and ERV class II: subclasses of LTR elements; hAT-Charlie, and TCMar-

Tigger: subclasses of DNA elements. (C) Frequency of subclasses of repetitive elements on different species of

rodents. Data for all but the two capybara species are reported in Smit, Hubley & Green (2015).

Genomic annotation and gene content

We annotated a total of 26,080 genes, 82% of which had an AED < 0.5 indicating high

quality of the annotations. The higher number of genes annotated in the lesser capybara compared to

the capybara can be explained by a less fragmented genome in the latter one (scaffold N50: 787 Kb

and 12.2 Mb, respectively). More than half of the annotations were involved in cellular process

(31.2%) and metabolic process (20.10%), followed by biological regulation (15.3%) and localization

(11.1%; Figure 5).

Page 12: Genomics of the capybara, two emblematic Colombian species

Figure 5. Genes predicted in the lesser capybara (Hydrochoerus isthmius) genome annotation that are involved

in: (A) Biological processes, (B) cellular components, (C) molecular functions, and (D) Protein classes from

Pfam.

Microsatellites - A total of 509,265 and 718,560 microsatellites were found in the capybara

and lesser capybara respectively. In both assemblies, the longer the unit size for the microsatellites,

the more uncommon they were, but in some instances for uneven numbers (repeat unit length = 3, 7,

11) n+1 would have a higher count (Table 5; Figure 6).

Figure 6. Counts of microsatellites found in each genome assembly. The y axis represent the repeat unit length

in base pairs, while the color and size of each circle represents the total count of microsatellites with that unit

size found in each of the genome assemblies.

Page 13: Genomics of the capybara, two emblematic Colombian species

Mitochondrial genome

Capybara - The mitogenome assembly of the capybara, H. hydrochaeris, consisted of a

scaffold with three gaps. It contains two ribosomal RNA genes (12S and 16S), 21 transfer RNA

genes, and 13 protein coding genes (CDS). The assembly seemed to suggest a duplication of tRNA-W

and the deletion of tRNA-R.

Lesser capybara - The mitogenome of H. isthmius assembled here had a length of 16,525

base pairs with a GC content of 39.37%, and contained two ribosomal RNA genes (12S and 16S), 22

transfer RNA genes, and 13 protein coding genes (CDS). No major rearrangements or gains/losses

were found relative to the mammalian mitogenomes previously reported.

See Table 6 for the size and position of genes within the mitochondrial genomes of each species and

Figure 7 for a visual comparison of both mitochondrias with the mitochondria from the guinea pig.

Figure 7. Mitochondrial genome of the (A) capybara (Hydrochoerus hydrochaeris), (B) lesser capybara (H.

isthmius) and (C) guinea pig (Cavia porcellus). Annotated by MITOS web server. The mitogenome sequences

were found in a single scaffold in each of the capybaras assemblies with Blast v2.9.0 from similarity with the

guinea pig (Cavia porcellus) mitochondrial genome. Image made in the CGView Server.

Demography

We fit a pairwise sequentially Markovian coalescent (PSMC) model to the genome assembly

of each capybara species to evaluate possible changes in effective population size in the recent past

(Figure 8). The capybara’s PSMC model suggested a relatively steady population size mildly

fluctuating from Ne = 10,000 to 25,000. For the lesser capybara, on the other hand, a sudden

population expansion started ~500,000, peaked around 200,000, and crashed back down to roughly

Ne = 20,000 some 100,000 years (Figure 8).

Page 14: Genomics of the capybara, two emblematic Colombian species

Figure 8. Pairwise sequentially Markovian coalescence analysis (PSMC) of the capybara and lesser capybara

genome assemblies (in blue and red, respectively). Time goes from the present on the left towards the past on

the right in the x-axis, and the y-axis represents effective population size (Ne) in units of 104.

Synteny between the capybaras and the common guinea pig

We aligned the capybaras’ whole genome assemblies against each other and each of them

independently against the guinea pig genome assembly (AccNum GCA_000151735.1) to search for

regions of big genomic changes. Between the capybaras and the guinea pig there was found a major

region of unmatching where the guinea pig may have gain/rearranged a region, or the capybaras

lost/rearranged it (red arrows in Figure 9A-B). Between both capybara species assemblies there were

not major rearrangements, but it is noticeable the lower contiguity of the lesser capybara assembly

(Figure 9C). Due to the low contiguity of the assemblies used for this analysis, it is not possible to

determine if the unmatches detected are due to one or multiple rearrangements nor in which specific

parts of the capybaras genomes are they present.

Figure 9. Pairwise circos plots showing synteny on 100 kb windows between the capybara, Hydrochoerus

hydrochaeris, the lesser capybara, H. isthmius, and the guinea pig. Each species is represented by a color

(Capybara: blue, lesser capybara: orange, and guinea pig: green). (A) Comparison between the guinea pig, left,

and lesser capybara, right. (B) Comparison between the capybara, left, and the lesser capybara, right. (C)

Comparison between the guinea pig, left, and the capybara, right.

Genomic diversity within Hydrochoerus

Page 15: Genomics of the capybara, two emblematic Colombian species

We extracted BUSCO genes from the transcriptome of the capybara from the eastern Llanos

of Colombia, in the transcriptomes predicted in the gene annotations of the capybara (Bolivia) and

lesser capybara (western Colombia) genome assemblies, and in the guinea pig to use as an outgroup.

From this, we reconstructed a phylogenomic tree based on 2325 concatenated genes to test the

following hypothesis: if H. hydrochaeris and H. isthmius are distinct species, both capybara samples

would form a clade relative to the lesser capybara. Instead of this, we found that the lesser capybara

was nested inside of the capybara clade, being genetically closer to the Bolivian sample than to the

geographically more proximal Colombian Llanos sample, indicating a complex phylogeographic

history (Figure 10).

Figure 10. Simple phylogeographic analysis of the capybara (n = 2 localities) and the lesser capybara. For this

analysis, transcript samples from a capybara from Colombia, genetic samples from a capybara from Bolivia, and

a lesser capybara from Colombia were used. Each of the samples are mapped using yellow lines from the

phylogenetic tree to the region were they came from. The lesser capybara’s geographic range is indicated in

purple, and in blue the capybara’s geographic range.

Phylogenomic analysis

We inferred phylogenetic relationships among the two capybara species and available

Hystricomorph species based on proteomes in Ensembl, using the mouse as an outgroup. We found a

total of 508 single copy orthologs present in all 9 species that we subsequently used in the

phylogenomic analysis. Within the Hytricomorpha subclass we found three distinct clades: one

composed of the Damara mole rat and the naked mole rat (Family: Bathyergidae) that diverged from

the rest approximately 56 million years ago (mya), a second clade containing the chinchilla (Family:

Chinchillidae) and degu (Family: Octodontidae), and a third clade containing the guinea pigs and

capybara species (Family: Caviidae), these two last mentioned clades, separating from each other

approximately 30 mya (Figure 11).

Page 16: Genomics of the capybara, two emblematic Colombian species

Figure 11. Phylogenetic relationships among rodent species using single copy orthologs found in all nine

species. These orthologs were aligned, filtered for quality and concatenated with FASconCAT-G v1.04. Then a

maximum likelihood tree was constructed with RAxML and the species divergence was calculated with

MCMCTree. All Hystrichomorpha proteomes available on Ensembl, and the mouse’, were used as inputs. Blue

lines represent 95% credibility intervals around divergence times. Numbers on the upper x-axis represent

millions of years ago, and letters on the lower x-axis represent geological epochs (La: late cretaceous; Pa.:

paleocene; Eo.: eocene; Ol.: oligocene; Mi: miocene).

Page 17: Genomics of the capybara, two emblematic Colombian species

Discussion

Genome and transcriptome assemblies and annotation

Here we report the first genome assembly for the lesser capybara as well as the first transcriptome

assembly for the capybara. As has been demonstrated previously, low cost genome assemblies like the

ones provided by 10X genomics are an incredible tool for understanding a species from a genomic

perspective (e.g., Armstrong et al., 2019; Kocher et al., 2018; Hulse-Kemp, 2018). Even if the

resulting assembly is not highly contiguous, these kind of technologies allow one to infer a range of

biological processes.

Rapid population changes in the lesser capybara

PSMC models use coalescent times across heterozygous sites on a single diploid genome to infer

effective population size changes over time (Li & Durbin, 2011). Nadachowska‐Brzyska et al. (2016)

suggested that PSMC models are reliable only on genome assemblies with a mean coverage over 18X

and no more than 25% of missing data, thresholds that our capybara and lesser capybara genome

assemblies surpass. From the PSMC models reported here, we can see that both species are tending

into reducing their population sizes (Figure 8), a tendency that is more drastic in the lesser capybara.

These trends are seen in other large mammal species which tend to need larger patches of

uninterrupted habitat to exist (Berger, 2017). Currently, the capybara’s habitat is under various threats

and is being reduced due to extensive changes in land use (Göpel et al., 2019). Other reasons this

trend is evident in capybara is a change from hunting to breeding them for eating purposes by local

people (da Rosa et al., 2019). Active breeding reduces the effective population size leading to

problems such as inbreeding depression and higher genetic load (Wand, Santiago & Caballero; 2016;

Hedrick & Garcia-Dorado, 2016).

Phylogenomic analyses places capybaras with other caviids

Our phylogenomic analysis places capybaras as sister to guinea pigs (Family Caviidae; genus: Cavia

spp.) and the caviids as sister to a group conformed of chinchillas and degus (Families: Chinchillidae

and Octodontidae). Together these families conform the Caviomorpha and this clade is the sister to

Phiomorpha represented by the family Bathyergidae, the mole rats, in this analysis. This topology

supports that found previously (Upham & Patterson, 2015; Álvarez, Arévalo, & Verzi, 2017). Species

diversification times for the different groups match those reported by Álvarez, Arévalo, & Verzi

(2017), but different from the ones reported by Upham & Patterson (2015). However, Upham and

Patterson (2015) used sequences from two mitochondrial genes and three nuclear genes while

Álvarez, Arévalo, & Verzi, (2017) used mitochondrial genes in addition to five nuclear genes. We are

highly confident in the results reported here due to the use of whole genome assemblies to find single

copy orthologs and the quality filter steps taken, and since we took two different approaches, a

concatenated maximum likelihood tree and a neighbour joining approach that takes each gene as

independent from each other, to account for phylogenetic inference errors such as incomplete lineage

sorting.

Lesser capybara is nested inside capybara

Phylogenomic analyses demonstrate that the lesser capybara is more closely related to the capybara

sample from Bolivia that it is to the capybara from the Llanos of Colombia. Even though this result

suggests that the capybara (H. hydrochaeris) is a paraphyletic species, there is morphological

evidence showing that the clades are divergent (Mones, 1991). Additionally, given the small sample

size (n = 3), further population genetic studies coupled with morphological analyses should be carried

out. If the pattern seen here is true, it would be even more concerning the fact that some populations

Page 18: Genomics of the capybara, two emblematic Colombian species

have been artificially inbred as a consequence of humans breeding them since this fact can disrupt

evolutionary processes that are isolating capybaras from Eastern and Western Colombia, which are

separated by the Andes mountain range.

Page 19: Genomics of the capybara, two emblematic Colombian species

Acknowledgements

This work was supported by Colciencias grant 1204-659-44334 (to AJC). Special thanks to: the DCB

and the Facultad de Ciencias of the Universidad de los Andes for giving us access to the Magnus

cluster on which all the computational analyses were run. Thanks to the University de los Andes Vice-

president’s office for help with collecting and mobilization permits. Thanks to the members of the

Biom|ics lab whose comments and help were invaluable for this project, to Juanita Herrera, María

José Páramo and Diego Perico for their field help in the collection of samples. Thanks to Catalina

Palacios, Phil Morin, and Alejandro Reyes for their analytical help and advice. Finally, thanks to

Rachel Voyt and Melissa Hernández for their insightful comments on this manuscript.

Page 20: Genomics of the capybara, two emblematic Colombian species

Literature cited

Aldana-Domínguez, J., Vieira-Muñoz, M. I., & Bejarano, P. (2013). Conservation and use of the

capybara and the lesser capybara in Colombia. In Capybara (pp. 321-332). Springer, New York, NY.

Álvarez, A., Arévalo, R. L. M., & Verzi, D. H. (2017). Diversification patterns and size evolution in

caviomorph rodents. Biological Journal of the Linnean Society, 121(4), 907-922.

Armenteros, J. J. A., Tsirigos, K. D., Sønderby, C. K., Petersen, T. N., Winther, O., Brunak, S., ... &

Nielsen, H. (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks.

Nature biotechnology, 37(4), 420.

Armstrong, E. E., Taylor, R. W., Prost, S., Blinston, P., van der Meer, E., Madzikanda, H., ... & Petrov,

D. (2019). Cost-effective assembly of the African wild dog (Lycaon pictus) genome using linked reads.

GigaScience, 8(2), giy124.

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., ... & Harris, M. A.

(2000). Gene ontology: tool for the unification of biology. Nature genetics, 25(1), 25.

Beier, S., Thiel, T., Münch, T., Scholz, U., & Mascher, M. (2017). MISA-web: a web server for

microsatellite prediction. Bioinformatics, 33(16), 2583-2585.

https://doi.org/10.1093/bioinformatics/btx198

Berger, J. O. E. L. (2017). The science and challenges of conserving large wild mammals in 21st-

century American protected areas in. Science, Conservation, and National, 189-211.

Bernt, M., Donath, A., Jühling, F., Externbrink, F., Florentz, C., Fritzsch, G., ... & Stadler, P. F. (2013).

MITOS: improved de novo metazoan mitochondrial genome annotation. Molecular Phylogenetics and

Evolution, 69(2), 313-319.

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence

data. Bioinformatics, 30(15), 2114-2120.

Brown, C. T., Howe, A., Zhang, Q., Pyrkosz, A. B., & Brom, T. H. (2012). A reference-free algorithm

for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802.

Bryant, D. M., Johnson, K., DiTommaso, T., Tickle, T., Couger, M. B., Payzin-Dogru, D., ... &

Bateman, J. (2017). A tissue-mapped axolotl de novo transcriptome enables identification of limb

regeneration factors. Cell reports, 18(3), 762-776.

Burgin, C. J., Colella, J. P., Kahn, P. L., & Upham, N. S. (2018). How many species of mammals are

there?. Journal of Mammalogy, 99(1), 1-14.

Bushmanova, E., Antipov, D., Lapidus, A., Suvorov, V., & Prjibelski, A. D. (2016). rnaQUAST: a

quality assessment tool for de novo transcriptome assemblies. Bioinformatics, 32(14), 2210-2212.

Cahill, J. A., Soares, A. E., Green, R. E., & Shapiro, B. (2016). Inferring species divergence times

using pairwise sequential Markovian coalescent modelling and low-coverage genomic data.

Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1699), 20150138.

Camacho‐Sanchez, M., Burraco, P., Gomez‐Mestre, I., & Leonard, J. A. (2013). Preservation of RNA

and DNA from mammal samples under field conditions. Molecular Ecology Resources, 13(4), 663-

673.

Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated

alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972-1973.

Carrascal, J., Linares, J., & Chacón, J. (2011). Behavior of the Hydrochoerus hydrochaeris isthmius in

a productive system, department of Córdoba, Colombia. Revista MVZ Córdoba, 16(3), 2754-2764.

Page 21: Genomics of the capybara, two emblematic Colombian species

Chernomor, O., von Haeseler, A., & Minh, B. Q. (2016). Terrace aware data structure for

phylogenomic inference from supermatrices. Systematic biology, 65(6), 997-1008.

Correa, J. B., & Jorgenson, J. P. (2009). Aspectos poblacionales del cacó (Hydrochoerus hydrochaeris

isthmius) y amenazas para su conservación en el Nor-Occidente de Colombia. Mastozoología

neotropical, 16(1), 27-38.

Corriale, M. J., & Herrera, E. A. (2014). Patterns of habitat use and selection by the capybara

(Hydrochoerus hydrochaeris): a landscape‐scale analysis. Ecological research, 29(2), 191-201.

Crusoe, M. R., Alameldin, H. F., Awad, S., Boucher, E., Caldwell, A., Cartwright, R., ... & Fenton, J.

(2015). The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research,

4.

da Rosa, P. P., Ávila, B. P., Costa, P. T., Fluck, A. C., Scheibler, R. B., Ferreira, O. G. L., & Gularte,

M. A. (2019). Analysis of the perception and behavior of consumers regarding capybara meat by means

of exploratory methods. Meat science, 152, 81-87.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... & McVean, G.

(2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158.

Delgado, C. & Emmons, L. (2016). Hydrochoerus isthmius . The IUCN Red List of Threatened Species

2016: e.T136277A22189896. https://dx.doi.org/10.2305/IUCN.UK.2016-

2.RLTS.T136277A22189896.en.

Göpel, J., Schüngel, J., Schaldach, R., Stuch, B., & Löbelt, N. (2019). Assessing the effects of

agricultural intensification on natural habitats and biodiversity in Southern Amazonia. bioRxiv, 846709.

Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., ... & Chen, Z.

(2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature

biotechnology, 29(7), 644.

Grant, J. R., & Stothard, P. (2008). The CGView Server: a comparative genomics tool for circular

genomes. Nucleic acids research, 36(suppl_2), W181-W184.

Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for

genome assemblies. Bioinformatics, 29(8), 1072-1075.

El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C., ... & Sonnhammer, E. L.

L. (2018). The Pfam protein families database in 2019. Nucleic acids research, 47(D1), D427-D432.

Emilio, A. H., & Macdonald, D. W. (1994). Social significance of scent marking in capybaras. Journal

of Mammalogy, 75(2), 410-415.

Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative

genomics. BioRxiv, 466201.

Fabre, P. H., Hautier, L., Dimitrov, D., & Douzery, E. J. (2012). A glimpse on the pattern of rodent

diversification: a phylogenetic approach. BMC evolutionary biology, 12(1), 88.

Haas, B., & Papanicolaou, A. (2015). TransDecoder (find coding regions within transcripts). Github,

nd https://github. com/TransDecoder/TransDecoder.

Hafner, J. C., & Hafner, M. S. (1988). Heterochrony in rodents. In Heterochrony in Evolution (pp. 217-

235). Springer, Boston, MA.

Hedges, S. B., Dudley, J., & Kumar, S. (2006). TimeTree: a public knowledge-base of divergence

times among organisms. Bioinformatics, 22(23), 2971-2972.

Page 22: Genomics of the capybara, two emblematic Colombian species

Hedrick, P. W., & Garcia-Dorado, A. (2016). Understanding inbreeding depression, purging, and

genetic rescue. Trends in Ecology & Evolution, 31(12), 940-952.

Herrera-Álvarez, S., Karlsson, E., Ryder, O. A., Lindblad-Toh, K., & Crawford, A. J. (2018). How to

make a rodent giant: Genomic basis and tradeoffs of gigantism in the capybara, the world’s largest

rodent. BioRxiv, 424606. https://doi.org/10.1101/424606

Holt, C., & Yandell, M. (2011). MAKER2: an annotation pipeline and genome-database management

tool for second-generation genome projects. BMC bioinformatics, 12(1), 491.

Hulse-Kemp, A. M., Maheshwari, S., Stoffel, K., Hill, T. A., Jaffe, D., Williams, S. R., ... & Schatz, M.

C. (2018). Reference quality assembly of the 3.5-Gb genome of Capsicum annuum from a single

linked-read library. Horticulture research, 5(1), 1-13.

Huerta-Cepas, J., Szklarczyk, D., Forslund, K., Cook, H., Heller, D., Walter, M. C., ... & Jensen, L. J.

(2015). eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for

eukaryotic, prokaryotic and viral sequences. Nucleic acids research, 44(D1), D286-D293.

Jackman, S. D., Coombe, L., Chu, J., Warren, R. L., Vandervalk, B. P., Yeo, S., ... & Birol, I. (2018).

Tigmint: correcting assembly errors using linked reads from large molecules. BMC bioinformatics,

19(1), 393.

Jarne, P., & Lagoda, P. J. (1996). Microsatellites, from molecules to populations and back. Trends in

ecology & evolution, 11(10), 424-429.

Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., ... & Pesseat, S. (2014).

InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236-1240.

Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7:

improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780.

Kocher, S. D., Mallarino, R., Rubin, B. E., Douglas, W. Y., Hoekstra, H. E., & Pierce, N. E. (2018).

The genetic basis of a social polymorphism in halictid bees. Nature communications, 9(1), 1-7.

Krogh, A., Larsson, B., Von Heijne, G., & Sonnhammer, E. L. (2001). Predicting transmembrane

protein topology with a hidden Markov model: application to complete genomes. Journal of molecular

biology, 305(3), 567-580.

Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., ... & Marra, M. A.

(2009). Circos: an information aesthetic for comparative genomics. Genome research, 19(9), 1639-

1645.

Kück, P., & Meusemann, K. (2010). FASconCAT: convenient handling of data matrices. Molecular

Phylogenetics and Evolution, 56(3), 1115-1118.

Lander, E., Linton, L., Birren, B. et al (2001). Initial sequencing and analysis of the human genome.

Nature 409, 860–921. doi:10.1038/35057062.

Letunic, I., & Bork, P. (2019). Interactive Tree Of Life (iTOL) v4: recent updates and new

developments. Nucleic acids research.

Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and

population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform.

bioinformatics, 25(14), 1754-1760.

Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-genome

sequences. Nature, 475(7357), 493.

Page 23: Genomics of the capybara, two emblematic Colombian species

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The

sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.

Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce

the size of large protein databases. Bioinformatics, 17(3), 282-283.

Liu, L., & Yu, L. (2011). Estimating species trees from unrooted gene trees. Systematic biology, 60(5),

661-667.

Lord, R. D. (1994). A descriptive account of capybara behaviour. Studies on neotropical fauna and

environment, 29(1), 11-22.

Macdonald, D. W. (1981). Dwindling resources and the social behaviour of capybaras,(Hydrochoerus

hydrochaeris)(Mammalia). Journal of Zoology, 194(3), 371-391.

Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A., & Punta, M. (2013). Challenges in homology search:

HMMER3 and convergent evolution of coiled-coil regions. Nucleic acids research, 41(12), e121-e121.

Mones, A. (1991). Monografía de la familia Hydrochoeridae (Mammalia: Rodentia).

Moreira, J. R., Alvarez, M. R., Tarifa, T., Pacheco, V., Taber, A., Tirira, D. G., ... & Macdonald, D. W.

(2013). Taxonomy, natural history and distribution of the capybara. In Capybara (pp. 3-37). Springer,

New York, NY.

Nadachowska‐Brzyska, K., Burri, R., Smeds, L., & Ellegren, H. (2016). PSMC analysis of effective

population sizes in molecular ecology and its application to black‐and‐white Ficedula flycatchers.

Molecular ecology, 25(5), 1058-1072.

Nguyen, L. T., Schmidt, H. A., von Haeseler, A., & Minh, B. Q. (2014). IQ-TREE: a fast and effective

stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular biology and

evolution, 32(1), 268-274.

Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and

bias-aware quantification of transcript expression. Nature methods, 14(4), 417.

Paulino, D., Warren, R. L., Vandervalk, B. P., Raymond, A., Jackman, S. D., & Birol, I. (2015). Sealer:

a scalable gap-closing application for finishing draft genomes. BMC bioinformatics, 16(1), 230.

Pinheiro, M. S., & Moreira, J. R. (2013). Products and uses of capybaras. In Capybara (pp. 211-227).

Springer, New York, NY.

Putnam, N. H., O'Connell, B. L., Stites, J. C., Rice, B. J., Blanchette, M., Calef, R., ... & Haussler, D.

(2016). Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome

research, 26(3), 342-350.

Puttick, M. N. (2019). MCMCtreeR: functions to prepare MCMCtree analyses and visualize posterior

ages on trees. Bioinformatics, 35(24), 5321-5322.

Reid, F. (2016). Hydrochoerus hydrochaeris . The IUCN Red List of Threatened Species 2016:

e.T10300A22190005. https://dx.doi.org/10.2305/IUCN.UK.2016-2.RLTS.T10300A22190005.en.

Rosenfield, D. A., Nichi, M., Losano, J. D., Kawai, G., Leite, R. F., Acosta, A. J., ... & Pizzutto, C. S.

(2019). Field-testing a single-dose immunocontraceptive in free-ranging male capybara (Hydrochoerus

hydrochaeris): Evaluation of effects on reproductive physiology, secondary sexual characteristics, and

agonistic behavior. Animal reproduction science, 209, 106148.

Rowe, D. L., & Honeycutt, R. L. (2002). Phylogenetic relationships, ecological correlates, and

molecular evolution within the Cavioidea (Mammalia, Rodentia). Molecular Biology and Evolution,

19(3), 263-277.

Page 24: Genomics of the capybara, two emblematic Colombian species

Samuels, J. X. (2009). Cranial morphology and dietary habits of rodents. Zoological Journal of the

Linnean Society, 156(4), 864-888.

Smit, A. F. A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. 2013–2015.

Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large

phylogenies. Bioinformatics, 30(9), 1312-1313.

Trillmich, F., Kraus, C., Künkele, J., Asher, M., Clara, M., Dekomien, G., ... & Sachser, N. (2004).

Species-level differentiation of two cryptic species pairs of wild cavies, genera Cavia and Galea, with a

discussion of the relationship between social systems and phylogeny in the Caviinae. Canadian

Journal of Zoology, 82(3), 516-524.

UniProt Consortium. (2018). UniProt: a worldwide hub of protein knowledge. Nucleic acids research,

47(D1), D506-D515.

Upham, N. S., & Patterson, B. D. (2015). Evolution of caviomorph rodents: a complete phylogeny and

timetree for living genera. Biology of caviomorph rodents: diversity and evolution. Buenos Aires:

SAREM Series A, 1, 63-120.

Waterhouse, R. M., Seppey, M., Simão, F. A., Manni, M., Ioannidis, P., Klioutchnikov, G., ... &

Zdobnov, E. M. (2017). BUSCO applications from quality assessments to gene prediction and

phylogenomics. Molecular biology and evolution, 35(3), 543-548.

Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., ... & Antonarakis,

S. E. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915),

520-562.

Wang, J., Santiago, E., & Caballero, A. (2016). Prediction and estimation of effective population size.

Heredity, 117(4), 193-206.

Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M., & Jaffe, D. B. (2017). Direct determination of

diploid genome sequences. Genome research, 27(5), 757-767. doi: 10.1101/gr.214874.116.

Weisenfeld, N. I., Yin, S., Sharpe, T., Lau, B., Hegarty, R., Holmes, L., ... & Nusbaum, C. (2014).

Comprehensive variation discovery in single human genomes. Nature genetics, 46(12), 1350.

Whelan, S., Irisarri, I., & Burki, F. (2018). PREQUAL: detecting non-homologous characters in sets of

unaligned homologous sequences. Bioinformatics, 34(22), 3929-3930.

Wilson, D. E., & Reeder, D. M. (Eds.). (2005). Mammal species of the world: a taxonomic and

geographic reference (Vol. 1). JHU Press.

Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and

evolution, 24(8), 1586-1591.

Yeo, S., Coombe, L., Warren, R. L., Chu, J., & Birol, I. (2017). ARCS: scaffolding genome drafts with

linked reads. Bioinformatics, 34(5), 725-731.

Young, M. D., Wakefield, M. J., Smyth, G. K., & Oshlack, A. (2010). Gene ontology analysis for

RNA-seq: accounting for selection bias. Genome biology, 11(2), R14.

Page 25: Genomics of the capybara, two emblematic Colombian species

Tables

Table 1. Quality metrics reported by the software Supernova v2.0.1 before and after assembling the

lesser capybara genome (Weisenfeld et al., 2017).

Input statistics

Number of reads 1751.10 M

Mean read length after trimming 139.50 b

Raw coverage 84.29X

Effective read coverage 50.16X

Fraction of Q30 bases in read 2 75.26%

Median insert size 345.00b

Fraction of proper read pairs 89.81%

Fraction of barcodes used 1

Estimated genome size 3.14 Gb

Genome repetitivity index 9.95%

High AT index 0.06%

GC content of assembly 40.04%

Dinucleotide content 1.23%

Weighted mean molecule size 36.91 Kb

Molecule count extending 10 kb on both sides 67.21

Mean distance between heterozygous SNPs 2.28 Kb

Fraction of reads that are not barcoded 6.97%

N50 reads per barcode 1.36 K

Fraction of reads that are duplicates 30.66%

Nonduplicate and phased reads 38.94%

Table 2. Rodent proteomes used for comparative analyses.

Common name Species Genome version Accession number

Algerian mouse Mus spretus SPRET_EiJ_v1 GCA_001624865.1

Alpine marmot Marmota marmota marmota marMar2.1 GCA_001458135.1

American beaver Castor canadensis C.can_genome_v1.0 GCA_001984765.1

Arctic ground squirrel Urocitellus parryii ASM342692v1 GCA_003426925.1

Brazilian guinea pig Cavia aperea CavAp1.0 GCA_000688575.1

Chinese hamster CriGri Cricetulus griseus CriGri_1.0 GCA_000223135.1

Daurian ground squirrel Spermophilus dauricus ASM240643v1 GCA_002406435.1

Degu Octodon degus OctDeg1.0 GCA_000260255.1

Page 26: Genomics of the capybara, two emblematic Colombian species

Golden Hamster Mesocricetus auratus MesAur1.0 GCA_000349665.1

Guinea Pig Cavia porcellus Cavpor3.0 GCA_000151735.1

Kangaroo rat Dipodomys ordii Dord_2.0 GCA_000151885.2

Lesser Egyptian jerboa Jaculus jaculus JacJac1.0 GCA_000280705.1

Long-tailed chinchilla Chinchilla lanigera ChiLan1.0 GCA_000276665.1

Mongolian gerbil Meriones unguiculatus MunDraft-v1.0 GCA_002204375.1

Damara mole rat Fukomys damarensis DMR_v1.0 GCA_000743615.1

Naked mole-rat Heterocephalus glaber HetGla_female_1.0 GCA_000247695.1

Squirrel Ictidomys tridecemlineatus SpeTri2.0 GCA_000236235.1

Prairie vole Microtus ochrogaster MicOch1.0 GCA_000317375.1

Ryukyu mouse Mus caroli CAROLI_EIJ_v1.1 GCA_900094665.2

Mouse Mus musculus GRCm38.p6 GCA_000001635.8

Shrew mouse Mus pahari PAHARI_EIJ_v1.1 GCA_900095145.2

Steppe mouse Mus spicilegus MUSP714 GCA_003336285.1

Upper Galilee mountains blind mole rat Nannospalax galii S.galili_v1.0 GCA_000622305.1

Rabbit Oryctolagus cuniculus OryCun2.0 GCA_000003625.1

Northern American deer mouse Peromyscus maniculatus bairdii HU_Pman_2.1 GCA_003704035.1

Rat Rattus novergicus Rnor_6.0 GCA_000001895.4

Table 3. Tissues sequenced for the transcriptomic analysis.

Individuals sampled

Species Coll. Number Location (Lat, Lon) Sex

Capybara (Hydrochoerus hydrochaeris) AJC 05614 05.8106°, - 70.9718° Male

Capybara (Hydrochoerus hydrochaeris) AJC 05615 05.8106°, - 70.9718° Female (gravid)

Tissues sampled

1. Heart

2. Brain

3. Kidney

4. Testes

5. Ovaries

6. Morillo

7. Anal gland

8. Fetal tissue

Page 27: Genomics of the capybara, two emblematic Colombian species

9. Bone marrow

10. Thyroid gland

11. Pancreas

Table 4. Quast assembly statistics for the different steps taken during the assembly.

Assembly step Supernova Tigmint Arcs + Links Sealer (Final version)

Quast analysis Scaffolds Contigs Scaffolds Contigs Scaffolds Contigs Scaffolds Contigs

# contigs (>= 0 bp) 29608 - 29762 - 28300 - 28300 -

# contigs (>= 1000 bp) 29608 - 29679 - 28217 - 28217 -

# contigs (>= 5000 bp) 16923 - 16982 - 15548 - 15551 -

# contigs (>= 10000

bp) 13322 50406 13367 50406 11991 50406 11994 29315

# contigs (>= 25000

bp) 10095 37807 10140 37807 9046 37807 9043 23474

# contigs (>= 50000

bp) 8503 25211 8543 25211 7702 25211 7702 18502

Largest contig 20810282 1295922 14115789 1295922 14115789 1295922 14116249 2052147

GC (%) 40.01 40.01 40.01 40.01 40.01 40.01 40.02 40.01

Reference GC (%) 39.95 39.95 39.95 39.95 39.95 39.95 39.95 39.95

N50 694764 116657 692348 116667 787285 116613 787090 232449

NG50 993541 156066 988664 156066 1101859 156066 1101324 308863

N75 344971 63185 344508 63189 389945 63157 390021 125489

NG75 649192 107821 645404 107821 726481 107821 725900 216001

L50 1465 9763 1483 9762 1328 9768 1328 4962

LG50 788 5738 803 5738 732 5738 732 2927

L75 3421 20802 3445 20801 3048 20815 3048 10522

LG75 1650 10992 1670 10992 1493 10992 1494 5567

# misassemblies 0 0 0 0 0 0 0 0

# unaligned mis.

contigs 3846 6666 3865 6666 3685 6666 3687 5814

# unaligned contigs

7699 +

5623 part

44170 +

13918

part

7720 + 5647

part

44146 +

13918 part

6719 +

5272 part

44374 +

13922

part

6717 +

5277 part 21537 + 10144 part

# N's per 100 kbp 523.76 0 492.54 0 496.33 0 474.79 0.11

# indels per 100 kbp 402.9 403.6 402.9 403.6 402.79 403.56 403.03 403.38

Complete BUSCO (%) 84.16 81.19 84.16 81.19 84.16 81.19 84.16 82.84

Page 28: Genomics of the capybara, two emblematic Colombian species

Partial BUSCO (%) 2.97 4.29 2.97 4.29 3.3 4.62 3.3 4.29

Table 5. Microsatellites found in the capybara and lesser capybara genome assemblies.

Capybara - Hydrochoerus hydrochaeris

Unit size Number of SSRs

2 318056

3 60663

4 97980

5 25910

6 5210

7 319

8 721

9 187

10 126

11 8

12 85

Lesser capybara - Hydrochoerus isthmius

Unit size Number of SSRs

2 445243

3 87656

4 138686

5 38745

6 6562

7 493

8 727

9 163

10 151

11 17

12 117

Table 6. Capybara and lesser capybara mitogenome annotations.

Capybara - Hydrochoerus hydrochaeris

Page 29: Genomics of the capybara, two emblematic Colombian species

Name Feature Start Stop Strand

trnF tRNA 332 398 -

trnP tRNA 1756 1825 +

trnT tRNA 1832 1898 -

cob CDS 1908 3041 -

trnE tRNA 3049 3117 +

nad6 CDS 3130 3642 +

nad5 CDS 3657 5456 -

trnL1 tRNA 5457 5526 -

trnS1 tRNA 5526 5584 -

trnH tRNA 5588 5656 -

nad4 CDS 5667 7034 -

nad4l CDS 7031 7324 -

trnW tRNA 7326 7394 -

nad3 CDS 7397 7735 -

trnG tRNA 7742 7810 -

cox3_b CDS 7812 8033 -

cox3_a CDS 8039 8590 -

atp6 CDS 8596 9270 -

atp8 CDS 9246 9431 -

trnK tRNA 9433 9499 -

cox2-0 CDS 9506 10063 -

cox2-1 CDS 10062 10181 -

trnD tRNA 10183 10251 -

trnS2 tRNA 10259 10327 +

cox1_b CDS 10337 11872 -

trnY tRNA 11879 11947 +

trnC tRNA 11950 12016 +

trnN tRNA 12055 12127 +

trnA tRNA 12129 12197 +

trnW tRNA 12200 12269 -

nad2 CDS 12350 13297 -

trnM tRNA 13313 13381 -

Page 30: Genomics of the capybara, two emblematic Colombian species

trnQ tRNA 13384 13454 +

trnI tRNA 13452 13520 -

nad1 CDS 13528 14478 -

trnL2 tRNA 14482 14556 -

rrnL rRNA 14558 16125 -

trnV tRNA 16124 16192 -

rrnS rRNA 16190 16947 -

Lesser capybara - Hydrochoerus isthmius

Name Feature Start Stop Strand

trnP tRNA 1038 1107 +

trnT tRNA 1114 1181 -

cob CDS 1191 2324 -

trnE tRNA 2332 2400 +

nad6 CDS 2413 2925 +

nad5 CDS 2940 4739 -

trnL1 tRNA 4740 4809 -

trnS1 tRNA 4809 4867 -

trnH tRNA 4871 4939 -

nad4 CDS 4950 6317 -

nad4l CDS 6314 6607 -

trnR tRNA 6609 6677 -

nad3 CDS 6680 7018 -

trnG tRNA 7025 7093 -

cox3 CDS 7095 7877 -

atp6 CDS 7883 8557 -

atp8 CDS 8530 8718 -

trnK tRNA 8720 8786 -

cox2 CDS 8793 9470 -

trnD tRNA 9472 9540 -

trnS2 tRNA 9547 9615 +

cox1_b CDS 9625 10368 -

cox1_a CDS 10365 11159 -

trnY tRNA 11166 11234 +

Page 31: Genomics of the capybara, two emblematic Colombian species

trnC tRNA 11237 11303 +

trnN tRNA 11343 11415 +

trnA tRNA 11417 11485 +

trnW tRNA 11488 11557 -

nad2 CDS 11565 12584 -

trnM tRNA 12600 12668 -

trnQ tRNA 12671 12741 +

trnI tRNA 12739 12807 -

nad1 CDS 12815 13765 -

trnL2 tRNA 13769 13843 -

rrnL rRNA 13845 15413 -

trnV tRNA 15412 15480 -

rrnS rRNA 15478 16431 -

trnF tRNA 16431 16497 -