Upload
debashish
View
214
Download
2
Embed Size (px)
Citation preview
SHORT COMMUNICATION
Exploring Biotic Interactions Within Protist Cell PopulationsUsing Network MethodsShu Chenga,1, Dana C. Pricea,1, Slim Karkarb & Debashish Bhattacharyaa,b
a Department of Ecology, Evolution and Natural Resources, Rutgers University, New Brunswick, New Jersey, 08901, USA
b Institute of Marine and Coastal Science, Rutgers University, New Brunswick, New Jersey, 08901, USA
Keywords
Assortativity; Bigelowiella natans; network
analysis; Paulinella ovalis; scale-free
network; single cell genomics.
Correspondence
D. Bhattacharya, Rutgers University, 59
Dudley Road, Foran Hall 102, New
Brunswick, New Jersey, 08901, USA
Telephone number: +1 848-932-6218;
FAX number: +1 732-932-8746;
e-mail: [email protected]
1Equal contribution made by these authors.
Received: 4 October 2013; revised 6
January 2014; accepted February 2, 2014.
doi:10.1111/jeu.12113
ABSTRACT
The study of diseased human cells and of cells isolated from the natural envi-
ronment will likely be revolutionized by single cell genomics (SCG). Here, we
used protein similarity networks to explore within- and between-cell DNA dif-
ferences from SCG data derived from six individual rhizarian cells related to
Paulinella ovalis and proteins from the complete genome of another rhizarian,
Bigelowiella natans. We identified shared and distinct DNA components within
our SCG data and between P. ovalis and B. natans. We show that network
properties such as assortativity and degree effectively discriminate genome
features between SCG assemblies and that SCG data follow the power law
with a small number of protein families dominating networks.
NETWORK methods offer rapid and powerful tools for
analyzing complex information such as genomic or func-
tional genomic data (e.g. Bapteste et al. 2012; Barab�asiand Oltvai 2004; Komurov et al. 2012; Zhou et al. 2010).
Rather than studying collections of individual genes, it is
possible with network methods to interpret the biology of
cells and interactions between them as a system com-
prised of thousands of interacting components. Genome-
wide protein networks can be built using measures of
similarity (e.g. with a BLAST cutoff value) to create edges
that connect the different nodes (proteins) with groups of
related sequences in a network graph referred to as a con-
nected component (Beauregard-Racine et al. 2011). The
components represent genes and gene families that are
shared among the studied genomes. Network analysis
does not require predictive gene modeling and can be
implemented with freely available tools such as Cytoscape
(http://www.cytoscape.org/). This straightforward and
inclusive approach has been used to identify ancient con-
nections between members of gene families (i.e. when
relaxed cutoff values are used), for defining horizontal
gene transfer events, and for recognizing gene fusions
that may have important functional consequences (e.g.
Alvarez-Ponce et al. 2013; Beauregard-Racine et al. 2011).
These types of weak or reticulate evolutionary signal are
often difficult to assess using phylogenetic methods that
rely on simultaneous multiple sequence alignments to
generate bifurcating trees. Metagenomic and single cell
genome (SCG) data from natural samples (e.g. Halary
et al. 2010; Kalisky et al. 2011; Yoon et al. 2011) offer
other promising targets for applying network approaches
to study gene distribution.
Despite the promise, SCG methods still require consid-
erable refinement because these genome assemblies may
show significant coverage bias introduced by multiple dis-
placement amplification (MDA, Rodrigue et al. 2009; Wo-
yke et al. 2010 [although the recently developed MALBAC
procedure may ameliorate this issue; Zong et al. 2012]). In
addition, the challenge remains to assemble and analyze
complex DNA mixtures that include the host DNA and
potentially, associated nucleic acids from symbionts,
pathogens, and prey (Bhattacharya et al. 2012, 2013; Yoon
et al. 2011). Given that population-level SCG data will
likely become more widely available, here we explore the
use of network methods to study the protein comple-
ments of a wild-caught sample of microbial eukaryotes.
The data are derived from six individual Paulinella ovalis-
like cells (phagotrophic rhizarian protists) that form a sister
© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists
Journal of Eukaryotic Microbiology 2014, 61, 399–403 399
Journal of Eukaryotic Microbiology ISSN 1066-5234
Published bythe International Society of ProtistologistsEukaryotic Microbiology
The Journal of
group to photosynthetic Paulinella species and were previ-
ously isolated from a single sample collected in Chesa-
peake Bay, U.S. All of the P. ovalis-like cells share 100%
small subunit rDNA identity (Bhattacharya et al. 2012).
The MDA-derived DNA was sequenced using the Roche
454 and Illumina platforms (for details, see Table S1 and
Bhattacharya et al. 2012). A seventh and complete prote-
ome from the photosynthetic (chlorarachniophyte) rhizarian
Bigelowiella natans (Curtis et al. 2012) was included in the
analysis to provide a eukaryotic reference point when
interpreting the P. ovalis-like network graphs.
Our primary goal was to create network data structures
that highlight both the (meta) genomic similarities and dif-
ferences between the six conspecific protist cells, and
more broadly, between the two rhizarian species. All six
P. ovalis-like cells were sampled at the same time and
location, therefore the network components that discrimi-
nate (e.g. are “assortative”, see below) these genome
data represent local differences in biotic interactions
among the individual cells and their environment. Assort-
ativity in this instance is a measure of the tendency of
sequences from a given SCG assembly to preferentially
connect to each other in a connected component. The
nominal assortative coefficient ri for each SCG and
between each SCG was calculated to determine if protein
sequences tend to be assortative with respect to origin.
This approach affords a priori identification and selection
of interesting genome features (i.e. a “visual snapshot”)
prior to computationally exhaustive BLAST and phyloge-
nomic analyses, and can act as a final check on metage-
nome sequencing data to ensure that variability has been
properly described.
MATERIALS AND METHODS
Protein similarity networks
The program EGN (Halary et al. 2010; http://www.evol--
net.fr) was used to build sequence similarity networks,
defined by BLASTp protein sequence identity ≥ 40% and
an e-value threshold < 1e�5. Each subgraph (or compo-
nent) in the network with such edges represents an oper-
ational gene family whose sequences do not share
significant similarity to other components. Proteins in the
data sets were annotated using Blast2GO (Conesa et al.
2005).
Two undirected networks (G-6) and (G-7) were gener-
ated: G-6 was made using open reading frames (ORFs) of
≥ 30 amino acids extracted via EMBOSS (Rice et al. 2000)
and clustered via CD-HIT (Li and Godzik 2006) at 92.5%
amino acid similarity from each assembly of six P. ovalis
cells (Bhattacharya et al. 2012), whereas G-7 was made
using ORFs from the six individual P. ovalis cells and the
predicted proteome from B. natans (Curtis et al. 2012). All
analyses are based on the G-6 network unless otherwise
specified. The 454 sequence data used to assemble the
P. ovalis-like genomes are archived at the NCBI Sequence
Read Archive (SRA) under Accession SRA049870.1. The
454 assemblies for each P. ovalis-like cell and the ORFs
used for this study are available at http://dbdata.rutgers.
edu/data/ovalis/.
Homology network
We constructed a homology network, G-h, in which
“Homology edges” were retained from the previous net-
work G-6 when the minimal match coverage was ≥ 90%
of the query sequence length. This cutoff resulted in
edges that limited the networks to putative homologs
resulting in over one-half (56%) of the proteins from the
G-6 network being removed due to a null degree of con-
nection (i.e. they were singletons). The G-h network con-
tained 10,726 components with a total of 25,152 proteins,
based on using only the 454 data from the six P. ovalis-
like cells. Given the large number of components repre-
senting one or few genes/gene families, 131 connected
components of size ≥ 5 proteins were used to calculate
assortativity (see below).
Network visualization and analysis
We analyzed the major component of network G-7 using
the Louvain community detection method (Blondel et al.
2008). This is a graph clustering heuristic based on modu-
larity optimization that identifies densely connected
regions of a graph and resolves a hierarchical clustering
structure within networks. On the basis of the clusters
provided by the algorithm at the first level, we visualized
the main component of G-7 using Cytoscape, with each
cluster corresponding to a node. The weight of the links
between the cluster nodes corresponds to the number of
links between the proteins in the original network G-7.
We measured assortativity in the connected compo-
nents from the homology network G-h with respect to the
six SCGs. Here, consider the element eij of a 2 9 2 mixing
matrix X which denotes the observed frequency of edges
between proteins of two categories i and j :
X =
i j Σi eijj ejiΣ 1
In the undirected case, eij ¼ eji ¼ eijþeji2 , eii denotes the fre-
quency of links between nodes of the same category i,
and ai and is the frequency of links involving proteins of
category i. Then, the assortativity coefficient is defined
similarly to Newman (2003):
ri ¼P
i eii �P
i a2i
1�Pi a
2i
Here, proteins of category i are proteins from the cell to
be considered, whereas proteins of category j are the pro-
teins from the other five cells. A positive coefficient indi-
cates that nodes of the same category tend to be linked
together, whereas a negative value indicates that nodes
of different categories tend to be linked together.
© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists
Journal of Eukaryotic Microbiology 2014, 61, 399–403400
Single Cell Network Analysis Cheng et al.
RESULTS AND DISCUSSION
Analysis of protein composition
As described above, here we used networks primarily as
a visualization tool for complex SCG data to generate
knowledge and testable hypotheses (e.g. Atkinson et al.
2009) about the ecology and genome biology of P. ovalis-
like cells. Analysis of the individual P. ovalis-like SCG
assemblies using the Core Eukaryotic Genes Mapping
Approach (CEGMA; http://korflab.ucdavis.edu/datasets/ce-
gma/) showed that across all cells, 240 unique hits
(BLASTx e-value cutoff ≤ 1e�10) exist to the KOG data-
base of 458 core eukaryotic proteins (Parra et al. 2007).
Individual cell assemblies however show between 24 (Cell
5) to 141 (Cell 1) significant hits to these core proteins
(Cell 6 was not included in the analysis) and only 11 pro-
tein hits were shared across all the cell assemblies. Of a
set of 248 “ultra-conserved” eukaryotic proteins available
in CEGMA, we recovered hits to 122 of these sequences
across all cells. These data demonstrate that the com-
bined 454 SCG assemblies account for about one-half of
the expected set of core eukaryotic proteins and that
these individual cell assemblies provide largely nonoverlap-
ping data with regard to the expected nuclear genome
complement of P. ovalis-like cells. To explore whether this
result reflects only issues associated with genome amplifi-
cation bias or a combination of bias and insufficient data,
we re-ran CEGMA with Cells 1 and 2 for which we had
generated significantly more Illumina paired-end data (4.7
and 3.8 Gbp, respectively, Bhattacharya et al. 2012). No
such data existed for Cells 3–6. Again, using a BLASTx
e-value cutoff ≤ 1e�10, we found 356/458 (75.5%) of the
core eukaryotic proteins and 180/248 (72.5%) of the ultra-
conserved set in the Cell 1 assembly. These numbers
were 333/458 (72.7%) and 164/248 (66.1%), respectively,
for Cell 2. Therefore, the large majority of core eukaryotic
proteins are present in these P. ovalis-like MDA samples,
although they were not recovered in the exploratory 454
sequencing approach used here for the six cells.
Protein networks
The G-7 network was created using ORFs of length > 30
amino acids from 180 to 308 Mbp of 454 data from each
P. ovalis-like cell (Table S1) and included predicted pro-
teins from the B. natans genome project. The largest
connected component joining proteins of a putative com-
mon origin (i.e., either a shared ancestry of the entire
protein or of a protein domain) was comprised of 1,043
sequences. The cluster-based representation of this com-
ponent (Fig. 1A) uses a pie chart to show the relative
abundance of cells for each cluster. This analysis identi-
fies the broadly conserved proteins or protein domains
(i.e. > 30 amino acids in length) that are shared by all six
P. ovalis-like cells along with B. natans. Within the com-
ponent, the largest cluster (#1; Fig. 1A) contains constitu-
tive components of eukaryote genomes (predominantly
kinases and ribonucleotide/nucleoside [ATP] binding
enzymes). The two clusters sharing a dense edge (#2,
#3) provide an example of the network’s ability to differ-
entiate between functional gene families and simple
domain sharing. Both clusters are enriched in ankyrin
repeat domain-containing proteins (ARD, 167 of 238 total
sequences). This widespread domain (Mosavi et al. 2002)
is responsible for the dense edge between meth-
yltransferases from cluster #3 and proteins involved in
signal transduction in cluster #2. This result emphasizes
the capacity of cluster-based analysis to correctly isolate
signal (the shared ARD domain) from noise in classifying
proteins that have distinct functions.
To identify proteins associated with each cell that poten-
tially provide insights into shared and unique biotic interac-
tions, we computed assortativity of the sequences from
the six SCG cells in the G-h network (that only included P.
ovalis-like SCG data). A heatmap of assortativity values for
each cell (Fig. 1B) shows that these data, as expected
due to their con-specificity, generally lack an assortative
pattern. Cell 1 has the most significant amount of unique
DNA associated with it, which results in very high assort-
ativity (red region indicated with an arrow) for a small
number of components. This analysis provides a rapid
method for assessing the relative contribution of noneuk-
aryote DNA associated with a SCG assembly. BLASTp
analysis confirms this result by showing that Cell 1 is the
most divergent SCG in this analysis with a large number
of proteobacterial proteins (Fig. S1). The cumulative
increase in the number of individual protein families (i.e.
components) as cells are added to the analysis (Fig. 1C)
demonstrates the complexity of natural samples, whereby
each SCG introduces novel DNA associated with it. With
deeper sequencing, the collection of P. ovalis-like protein
families will presumably converge (e.g. at ca. 10–15 K
genes as in many unicellular protists; see Fig. 1C) as the
host genome is fully sequenced. Additional growth will
derive from biotic interactions that give rise to foreign
DNAs present in the environment and associated with
individual cells, and can be identified via their assortativity
patterns.
It is widely recognized that various genome features
(e.g. types of protein folds, nucleotide patterns; Luscombe
et al. 2002; Proulx et al. 2005) follow power-law distribu-
tions, whereby a few types of features are dominant. We
find here that the probability that a protein in our network
has k edges to another node (or degree k) also follows a
power-law distribution P(k) ~ k�c, in which c = 2.39. The
protein families with the greatest number of edges in
Fig. 1D (e.g. Na [+]-dependent inorganic phosphate co-
transporter, prolyl aminopeptidase) are ancient, function-
ally important proteins that have undergone gene
duplication and divergence, processes thought to underlie
the power-law behavior of DNA data (Luscombe et al.
2002). The degree distribution shown here provides the
opportunity to identify, in uncharacterized SCG data, con-
served eukaryotic proteins for downstream analysis.
In summary, our exploratory study demonstrates the
utility of network topology structure to accurately repre-
sent the complex nature of SCG data derived from natural
© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists
Journal of Eukaryotic Microbiology 2014, 61, 399–403 401
Cheng et al. Single Cell Network Analysis
environments. We show that network community analysis
effectively highlights shared genome components and is
robust against arbitrary sequence similarity, whereas prop-
erties such as assortativity and degree allow the identifica-
tion of both outlier cells and core sets of shared proteins
for detailed study. As the growing power of SCG (e.g.
Zong et al. 2012) allows the assembly of near complete
SCGs and associated biota, it will be possible to empha-
size genome differences in entire populations of cells,
both free living and as tissue constituents (e.g. cancer
cells) from normal and diseased states. The use of differ-
ing identity and coverage values will result in components
that can serve downstream uses such as phylogenetics,
single nucleotide polymorphism analysis, and to identify
unique and shared prey, symbionts, and pathogens (e.g.
Bhattacharya et al. 2012; Yoon et al. 2011).
ACKNOWLEDGMENTS
This work was partially supported by NSF grants 0827023
and 0936884. SC was supported by the Gordon and Betty
Moore Foundation through Grant GBMF2807 to Paul Fal-
kowski at Rutgers University.
LITERATURE CITED
Alvarez-Ponce, D., Lopez, P., Bapteste, E. & McInerney, J. O.
2013. Gene similarity networks provide tools for understanding
eukaryote origins and evolution. Proc. Natl Acad. Sci. USA, 110:
E1594–1603.Atkinson, H. J., Morris, J. H., Ferrin, T. E. & Babbitt, P. C.
2009. Using sequence similarity networks for visualization of
relationships across diverse protein superfamilies. PLoS ONE,
4:e4345.
Figure 1 Network analysis of proteins from Paulinella ovalis-like Cells 1–6. A. Network representation of genome data from six P. ovalis-like cells
and from Bigelowiella natans reveals the extent of protein sequence sharing between the targeted genomes. Circles with pie charts show the
level of representation of different cells and of B. natans in the communities. Edges (gray lines) are weighted by the number of sequences con-
necting any two communities. B. Heatmap showing cell-based assortativity coefficients. The arrow identifies the relatively high assortativity of
Cell 1 data. C. Cumulative growth in unique gene families (represented by component counts in the network) based on 454 data from the six
SCGs (light gray panel, black solid line shows the linear fit). The broken line represents the approximate gene family number in the target eukary-
ote that does not continue to grow with the addition of SCG data once the nuclear genome has been fully sequenced with SCG analysis. D.
Scale-free degree distribution of protein connections (k) in the P. ovalis-like cell networks is described as a straight line on this log–log plot.
© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists
Journal of Eukaryotic Microbiology 2014, 61, 399–403402
Single Cell Network Analysis Cheng et al.
Bapteste, E., Lopez, P., Bouchard, F., Baquero, F., McInerney, J.
O. & Burian, R. M. 2012. Evolutionary analyses of non-genea-
logical bonds produced by introgressive descent. Proc. Natl
Acad. Sci. U S A, 109:18266–18272.Barab�asi, A. L. & Oltvai, Z. N. 2004. Network biology: understanding
the cell’s functional organization. Nat. Rev. Genet., 5:101–113.Beauregard-Racine, J., Bicep, C., Schliep, K., Lopez, P., Lapointe,
F. J. & Bapteste, E. 2011. Of woods and webs: possible alter-
natives to the tree of life for studying genomic fluidity in E. coli.
Biol. Direct, 6:39.
Bhattacharya, D., Price, D. C., Yoon, H. S., Yang, E. C., Poulton,
N. J., Andersen, R. A. & Das, S. P. 2012. Single cell genome
analysis supports a link between phagotrophy and primary plas-
tid endosymbiosis. Sci. Rep., 2:356.
Bhattacharya, D., Price, D. C., Bicep, C., Bapteste, E., Sarwade,
M., Rajah, V. D. & Yoon, H. S. 2013. Identification of a marine
cyanophage in a protist single-cell metagenome assembly. J.
Phycol., 49:207–212.Blondel, V., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. 2008.
Fast unfolding of communities in large networks. J. Stat.
Mech., 2008:P10008.
Conesa, A., Gotz, S., Garcia-Gomez, J. M., Terol, J., Talon, M. &
Robles, M. 2005. Blast2GO: a universal tool for annotation,
visualization and analysis in functional genomics research. Bioin-
formatics, 21:3674–3676.Curtis, B., Tanifuji, G., Burki, F., Gruber, A., Irimia, M., Maruyama,
S., Arias, M., Ball, S., Gile, G., Hirawaka, Y., Hopkins, J., Kuo,
A., Ransing, S., Schmutz, J., Symeonidi, A., Elias, M., Eveleigh,
R., Herman, E., Klute, M., Nakayama, T., Obornik, M., Reyes-
Prieto, A., Armbrust, E. V., Aves, S., Beiko, R. G., Coutinho, P.,
Dacks, J. B., Durnford, D. G., Fast, N. M., Green, B. R., Gris-
dale, C. J., Hempel, F., Henrissat, B., H€oppner, M. P., Ishida,
K., Kim, E., Ko�ren�y, L., Kroth, P. G., Liu, Y., Malik, S. B., Maier,
U. G., McRose, D., Mock, T., Neilson, J. A., Onodera, N. T.,
Poole, A. M., Pritham, E. J., Richards, T. A., Rocap, G., Roy, S.
W., Sarai, C., Schaack, S., Shirato, S., Slamovits, C. H., Spen-
cer, D. F., Suzuki, S., Worden, A. Z., Zauner, S., Barry, K., Bell,
C., Bharti, A. K., Crow, J. A., Grimwood, J., Kramer, R., Lind-
quist, E., Lucas, S., Salamov, A., McFadden, G. I., Lane, C. E.,
Keeling, P. J., Gray, M. W., Grigoriev, I. V. & Archibald, J. M.
2012. Algal genomes reveal evolutionary mosaicism and the
fate of nucleomorphs. Nature, 492:59–65.Halary, S., Leigh, J. W., Cheaib, B., Lopez, P. & Bapteste, E.
2010. Network analyses structure genetic diversity in indepen-
dent genetic worlds. Proc. Natl Acad. Sci. U S A, 107:127–132.Kalisky, T., Blainey, P. & Quake, S. R. 2011. Genomic analysis at
the single-cell level. Annu. Rev. Genet., 45:431–445.Komurov, K., Dursun, S., Erdin, S. & Ram, P. T. 2012. NetWalker:
a contextual network analysis tool for functional genomics.
BMC Genomics, 13:282.
Li, W. & Godzik, A. 2006. CD-HIT: a fast program for clustering
and comparing large sets of protein or nucleotide sequences.
Bioinformatics, 22:1658–1659.Luscombe, N. M., Qian, J., Zhang, Z., Johnson, T. & Gerstein, M.
2002. The dominance of the population by a selected few:
power-law behaviour applies to a wide variety of genomic prop-
erties. Genome Biol., 3:RESEARCH0040.
Mosavi, L., Minor Jr, D. & Peng, Z. 2002. Consensus-derived
structural determinants of the ankyrin repeat motif. Proc. Natl
Acad. Sci., 99:16029–16034.Newman, M. E. J. 2003. Mixing patterns in networks. Phys. Rev.
E, 67:026126.
Parra, G., Bradnam, K. & Korf, I. 2007. CEGMA: a pipeline to
accurately annotate core genes in eukaryotic genomes. Bioin-
formatics, 23:1061–1067.Proulx, S. R., Promislow, D. E. & Phillips, P. C. 2005. Network
thinking in ecology and evolution. Trends Ecol. Evol., 20:345–353.
Rice, P., Longden, I. & Bleasby, A. 2000. EMBOSS: The European
Molecular Biology Open Software Suite. Trends Genet.,
16:276–277.Rodrigue, S., Malmstrom, R. R., Berlin, A. M., Birren, B. W.,
Henn, M. R. & Chisholm, S. W. 2009. Whole genome amplifica-
tion and de novo assembly of single bacterial cells. PLoS ONE,
4:e6864.
Woyke, T., Tighe, D., Mavromatis, K., Clum, A., Copeland, A.,
Schackwitz, W., Lapidus, A., Wu, D., McCutcheon, J. P.,
McDonald, B. R., Moran, N. A., Bristow, J. & Cheng, J. F.
2010. One bacterial cell, one complete genome. PLoS ONE, 5:
e10314.
Yoon, H. S., Price, D. C., Stepanauskas, R., Rajah, V. D., Sie-
racki, M. E., Wilson, W. H., Yang, E. C., Duffy, S. & Bhattach-
arya, D. 2011. Single cell genomics reveals trophic interactions
and evolutionary history of uncultured protists. Science,
332:714–717.Zhou, J., Deng, Y., Luo, F., He, Z., Tu, Q. & Zhi, X. 2010. Func-
tional molecular ecological networksmBio, 1:e00169-10.
Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. 2012. Genome-wide
detection of single-nucleotide and copy-number variations of a
single human cell. Science, 338:1622–1626.
SUPPORTING INFORMATION
Additional Supporting Information may be found in the
online version of this article:
Fig. S1. Hierarchically clustered taxonomic distribution of
single cell data from P. ovalis-like Cells 1–6. This map is
based on BLASTp hits of the predicted proteins. Note that
Cell 1 has a relatively far distance from the other five cells
mainly due to its extremely high proportion of proteobac-
terial hits. Bacteria/phages are marked as red text and
eukaryotes as green text.
Table S1. Assembly output for the six Paulinella ovalis-like
cells studied in our work. The 454 reads were assembled
using the native Roche GS De Novo Assembler Software
V2.5 beta.
© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists
Journal of Eukaryotic Microbiology 2014, 61, 399–403 403
Cheng et al. Single Cell Network Analysis