Exploring Biotic Interactions Within Protist Cell Populations Using Network Methods

SHORT COMMUNICATION

Exploring Biotic Interactions Within Protist Cell PopulationsUsing Network MethodsShu Chenga,1, Dana C. Pricea,1, Slim Karkarb & Debashish Bhattacharyaa,b

a Department of Ecology, Evolution and Natural Resources, Rutgers University, New Brunswick, New Jersey, 08901, USA

b Institute of Marine and Coastal Science, Rutgers University, New Brunswick, New Jersey, 08901, USA

Keywords

Assortativity; Bigelowiella natans; network

analysis; Paulinella ovalis; scale-free

network; single cell genomics.

Correspondence

D. Bhattacharya, Rutgers University, 59

Dudley Road, Foran Hall 102, New

Brunswick, New Jersey, 08901, USA

Telephone number: +1 848-932-6218;

FAX number: +1 732-932-8746;

e-mail: [email protected]

1Equal contribution made by these authors.

Received: 4 October 2013; revised 6

January 2014; accepted February 2, 2014.

doi:10.1111/jeu.12113

ABSTRACT

The study of diseased human cells and of cells isolated from the natural envi-

ronment will likely be revolutionized by single cell genomics (SCG). Here, we

used protein similarity networks to explore within- and between-cell DNA dif-

ferences from SCG data derived from six individual rhizarian cells related to

Paulinella ovalis and proteins from the complete genome of another rhizarian,

Bigelowiella natans. We identified shared and distinct DNA components within

our SCG data and between P. ovalis and B. natans. We show that network

properties such as assortativity and degree effectively discriminate genome

features between SCG assemblies and that SCG data follow the power law

with a small number of protein families dominating networks.

NETWORK methods offer rapid and powerful tools for

analyzing complex information such as genomic or func-

tional genomic data (e.g. Bapteste et al. 2012; Barab�asiand Oltvai 2004; Komurov et al. 2012; Zhou et al. 2010).

Rather than studying collections of individual genes, it is

possible with network methods to interpret the biology of

cells and interactions between them as a system com-

prised of thousands of interacting components. Genome-

wide protein networks can be built using measures of

similarity (e.g. with a BLAST cutoff value) to create edges

that connect the different nodes (proteins) with groups of

related sequences in a network graph referred to as a con-

nected component (Beauregard-Racine et al. 2011). The

components represent genes and gene families that are

shared among the studied genomes. Network analysis

does not require predictive gene modeling and can be

implemented with freely available tools such as Cytoscape

(http://www.cytoscape.org/). This straightforward and

inclusive approach has been used to identify ancient con-

nections between members of gene families (i.e. when

relaxed cutoff values are used), for defining horizontal

gene transfer events, and for recognizing gene fusions

that may have important functional consequences (e.g.

Alvarez-Ponce et al. 2013; Beauregard-Racine et al. 2011).

These types of weak or reticulate evolutionary signal are

often difficult to assess using phylogenetic methods that

rely on simultaneous multiple sequence alignments to

generate bifurcating trees. Metagenomic and single cell

genome (SCG) data from natural samples (e.g. Halary

et al. 2010; Kalisky et al. 2011; Yoon et al. 2011) offer

other promising targets for applying network approaches

to study gene distribution.

Despite the promise, SCG methods still require consid-

erable refinement because these genome assemblies may

show significant coverage bias introduced by multiple dis-

placement amplification (MDA, Rodrigue et al. 2009; Wo-

yke et al. 2010 [although the recently developed MALBAC

procedure may ameliorate this issue; Zong et al. 2012]). In

addition, the challenge remains to assemble and analyze

complex DNA mixtures that include the host DNA and

potentially, associated nucleic acids from symbionts,

pathogens, and prey (Bhattacharya et al. 2012, 2013; Yoon

et al. 2011). Given that population-level SCG data will

likely become more widely available, here we explore the

use of network methods to study the protein comple-

ments of a wild-caught sample of microbial eukaryotes.

The data are derived from six individual Paulinella ovalis-

like cells (phagotrophic rhizarian protists) that form a sister

© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists

Journal of Eukaryotic Microbiology 2014, 61, 399–403 399

Journal of Eukaryotic Microbiology ISSN 1066-5234

Published bythe International Society of ProtistologistsEukaryotic Microbiology

The Journal of

group to photosynthetic Paulinella species and were previ-

ously isolated from a single sample collected in Chesa-

peake Bay, U.S. All of the P. ovalis-like cells share 100%

small subunit rDNA identity (Bhattacharya et al. 2012).

The MDA-derived DNA was sequenced using the Roche

454 and Illumina platforms (for details, see Table S1 and

Bhattacharya et al. 2012). A seventh and complete prote-

ome from the photosynthetic (chlorarachniophyte) rhizarian

Bigelowiella natans (Curtis et al. 2012) was included in the

analysis to provide a eukaryotic reference point when

interpreting the P. ovalis-like network graphs.

Our primary goal was to create network data structures

that highlight both the (meta) genomic similarities and dif-

ferences between the six conspecific protist cells, and

more broadly, between the two rhizarian species. All six

P. ovalis-like cells were sampled at the same time and

location, therefore the network components that discrimi-

nate (e.g. are “assortative”, see below) these genome

data represent local differences in biotic interactions

among the individual cells and their environment. Assort-

ativity in this instance is a measure of the tendency of

sequences from a given SCG assembly to preferentially

connect to each other in a connected component. The

nominal assortative coefficient ri for each SCG and

between each SCG was calculated to determine if protein

sequences tend to be assortative with respect to origin.

This approach affords a priori identification and selection

of interesting genome features (i.e. a “visual snapshot”)

prior to computationally exhaustive BLAST and phyloge-

nomic analyses, and can act as a final check on metage-

nome sequencing data to ensure that variability has been

properly described.

MATERIALS AND METHODS

Protein similarity networks

The program EGN (Halary et al. 2010; http://www.evol--

net.fr) was used to build sequence similarity networks,

defined by BLASTp protein sequence identity ≥ 40% and

an e-value threshold < 1e�5. Each subgraph (or compo-

nent) in the network with such edges represents an oper-

ational gene family whose sequences do not share

significant similarity to other components. Proteins in the

data sets were annotated using Blast2GO (Conesa et al.

2005).

Two undirected networks (G-6) and (G-7) were gener-

ated: G-6 was made using open reading frames (ORFs) of

≥ 30 amino acids extracted via EMBOSS (Rice et al. 2000)

and clustered via CD-HIT (Li and Godzik 2006) at 92.5%

amino acid similarity from each assembly of six P. ovalis

cells (Bhattacharya et al. 2012), whereas G-7 was made

using ORFs from the six individual P. ovalis cells and the

predicted proteome from B. natans (Curtis et al. 2012). All

analyses are based on the G-6 network unless otherwise

specified. The 454 sequence data used to assemble the

P. ovalis-like genomes are archived at the NCBI Sequence

Read Archive (SRA) under Accession SRA049870.1. The

454 assemblies for each P. ovalis-like cell and the ORFs

used for this study are available at http://dbdata.rutgers.

edu/data/ovalis/.

Homology network

We constructed a homology network, G-h, in which

“Homology edges” were retained from the previous net-

work G-6 when the minimal match coverage was ≥ 90%

of the query sequence length. This cutoff resulted in

edges that limited the networks to putative homologs

resulting in over one-half (56%) of the proteins from the

G-6 network being removed due to a null degree of con-

nection (i.e. they were singletons). The G-h network con-

tained 10,726 components with a total of 25,152 proteins,

based on using only the 454 data from the six P. ovalis-

like cells. Given the large number of components repre-

senting one or few genes/gene families, 131 connected

components of size ≥ 5 proteins were used to calculate

assortativity (see below).

Network visualization and analysis

We analyzed the major component of network G-7 using

the Louvain community detection method (Blondel et al.

2008). This is a graph clustering heuristic based on modu-

larity optimization that identifies densely connected

regions of a graph and resolves a hierarchical clustering

structure within networks. On the basis of the clusters

provided by the algorithm at the first level, we visualized

the main component of G-7 using Cytoscape, with each

cluster corresponding to a node. The weight of the links

between the cluster nodes corresponds to the number of

links between the proteins in the original network G-7.

We measured assortativity in the connected compo-

nents from the homology network G-h with respect to the

six SCGs. Here, consider the element eij of a 2 9 2 mixing

matrix X which denotes the observed frequency of edges

between proteins of two categories i and j :

X =

i j Σi eijj ejiΣ 1

In the undirected case, eij ¼ eji ¼ eijþeji2 , eii denotes the fre-

quency of links between nodes of the same category i,

and ai and is the frequency of links involving proteins of

category i. Then, the assortativity coefficient is defined

similarly to Newman (2003):

ri ¼P

i eii �P

i a2i

1�Pi a

2i

Here, proteins of category i are proteins from the cell to

be considered, whereas proteins of category j are the pro-

teins from the other five cells. A positive coefficient indi-

cates that nodes of the same category tend to be linked

together, whereas a negative value indicates that nodes

of different categories tend to be linked together.


Journal of Eukaryotic Microbiology 2014, 61, 399–403400

Single Cell Network Analysis Cheng et al.

RESULTS AND DISCUSSION

Analysis of protein composition

As described above, here we used networks primarily as

a visualization tool for complex SCG data to generate

knowledge and testable hypotheses (e.g. Atkinson et al.

2009) about the ecology and genome biology of P. ovalis-

like cells. Analysis of the individual P. ovalis-like SCG

assemblies using the Core Eukaryotic Genes Mapping

Approach (CEGMA; http://korflab.ucdavis.edu/datasets/ce-

gma/) showed that across all cells, 240 unique hits

(BLASTx e-value cutoff ≤ 1e�10) exist to the KOG data-

base of 458 core eukaryotic proteins (Parra et al. 2007).

Individual cell assemblies however show between 24 (Cell

5) to 141 (Cell 1) significant hits to these core proteins

(Cell 6 was not included in the analysis) and only 11 pro-

tein hits were shared across all the cell assemblies. Of a

set of 248 “ultra-conserved” eukaryotic proteins available

in CEGMA, we recovered hits to 122 of these sequences

across all cells. These data demonstrate that the com-

bined 454 SCG assemblies account for about one-half of

the expected set of core eukaryotic proteins and that

these individual cell assemblies provide largely nonoverlap-

ping data with regard to the expected nuclear genome

complement of P. ovalis-like cells. To explore whether this

result reflects only issues associated with genome amplifi-

cation bias or a combination of bias and insufficient data,

we re-ran CEGMA with Cells 1 and 2 for which we had

generated significantly more Illumina paired-end data (4.7

and 3.8 Gbp, respectively, Bhattacharya et al. 2012). No

such data existed for Cells 3–6. Again, using a BLASTx

e-value cutoff ≤ 1e�10, we found 356/458 (75.5%) of the

core eukaryotic proteins and 180/248 (72.5%) of the ultra-

conserved set in the Cell 1 assembly. These numbers

were 333/458 (72.7%) and 164/248 (66.1%), respectively,

for Cell 2. Therefore, the large majority of core eukaryotic

proteins are present in these P. ovalis-like MDA samples,

although they were not recovered in the exploratory 454

sequencing approach used here for the six cells.

Protein networks

The G-7 network was created using ORFs of length > 30

amino acids from 180 to 308 Mbp of 454 data from each

P. ovalis-like cell (Table S1) and included predicted pro-

teins from the B. natans genome project. The largest

connected component joining proteins of a putative com-

mon origin (i.e., either a shared ancestry of the entire

protein or of a protein domain) was comprised of 1,043

sequences. The cluster-based representation of this com-

ponent (Fig. 1A) uses a pie chart to show the relative

abundance of cells for each cluster. This analysis identi-

fies the broadly conserved proteins or protein domains

(i.e. > 30 amino acids in length) that are shared by all six

P. ovalis-like cells along with B. natans. Within the com-

ponent, the largest cluster (#1; Fig. 1A) contains constitu-

tive components of eukaryote genomes (predominantly

kinases and ribonucleotide/nucleoside [ATP] binding

enzymes). The two clusters sharing a dense edge (#2,

#3) provide an example of the network’s ability to differ-

entiate between functional gene families and simple

domain sharing. Both clusters are enriched in ankyrin

repeat domain-containing proteins (ARD, 167 of 238 total

sequences). This widespread domain (Mosavi et al. 2002)

is responsible for the dense edge between meth-

yltransferases from cluster #3 and proteins involved in

signal transduction in cluster #2. This result emphasizes

the capacity of cluster-based analysis to correctly isolate

signal (the shared ARD domain) from noise in classifying

proteins that have distinct functions.

To identify proteins associated with each cell that poten-

tially provide insights into shared and unique biotic interac-

tions, we computed assortativity of the sequences from

the six SCG cells in the G-h network (that only included P.

ovalis-like SCG data). A heatmap of assortativity values for

each cell (Fig. 1B) shows that these data, as expected

due to their con-specificity, generally lack an assortative

pattern. Cell 1 has the most significant amount of unique

DNA associated with it, which results in very high assort-

ativity (red region indicated with an arrow) for a small

number of components. This analysis provides a rapid

method for assessing the relative contribution of noneuk-

aryote DNA associated with a SCG assembly. BLASTp

analysis confirms this result by showing that Cell 1 is the

most divergent SCG in this analysis with a large number

of proteobacterial proteins (Fig. S1). The cumulative

increase in the number of individual protein families (i.e.

components) as cells are added to the analysis (Fig. 1C)

demonstrates the complexity of natural samples, whereby

each SCG introduces novel DNA associated with it. With

deeper sequencing, the collection of P. ovalis-like protein

families will presumably converge (e.g. at ca. 10–15 K

genes as in many unicellular protists; see Fig. 1C) as the

host genome is fully sequenced. Additional growth will

derive from biotic interactions that give rise to foreign

DNAs present in the environment and associated with

individual cells, and can be identified via their assortativity

patterns.

It is widely recognized that various genome features

(e.g. types of protein folds, nucleotide patterns; Luscombe

et al. 2002; Proulx et al. 2005) follow power-law distribu-

tions, whereby a few types of features are dominant. We

find here that the probability that a protein in our network

has k edges to another node (or degree k) also follows a

power-law distribution P(k) ~ k�c, in which c = 2.39. The

protein families with the greatest number of edges in

Fig. 1D (e.g. Na [+]-dependent inorganic phosphate co-

transporter, prolyl aminopeptidase) are ancient, function-

ally important proteins that have undergone gene

duplication and divergence, processes thought to underlie

the power-law behavior of DNA data (Luscombe et al.

2002). The degree distribution shown here provides the

opportunity to identify, in uncharacterized SCG data, con-

served eukaryotic proteins for downstream analysis.

In summary, our exploratory study demonstrates the

utility of network topology structure to accurately repre-

sent the complex nature of SCG data derived from natural



Cheng et al. Single Cell Network Analysis

environments. We show that network community analysis

effectively highlights shared genome components and is

robust against arbitrary sequence similarity, whereas prop-

erties such as assortativity and degree allow the identifica-

tion of both outlier cells and core sets of shared proteins

for detailed study. As the growing power of SCG (e.g.

Zong et al. 2012) allows the assembly of near complete

SCGs and associated biota, it will be possible to empha-

size genome differences in entire populations of cells,

both free living and as tissue constituents (e.g. cancer

cells) from normal and diseased states. The use of differ-

ing identity and coverage values will result in components

that can serve downstream uses such as phylogenetics,

single nucleotide polymorphism analysis, and to identify

unique and shared prey, symbionts, and pathogens (e.g.

Bhattacharya et al. 2012; Yoon et al. 2011).

ACKNOWLEDGMENTS

This work was partially supported by NSF grants 0827023

and 0936884. SC was supported by the Gordon and Betty

Moore Foundation through Grant GBMF2807 to Paul Fal-

kowski at Rutgers University.

LITERATURE CITED

Alvarez-Ponce, D., Lopez, P., Bapteste, E. & McInerney, J. O.

2013. Gene similarity networks provide tools for understanding

eukaryote origins and evolution. Proc. Natl Acad. Sci. USA, 110:

E1594–1603.Atkinson, H. J., Morris, J. H., Ferrin, T. E. & Babbitt, P. C.

2009. Using sequence similarity networks for visualization of

relationships across diverse protein superfamilies. PLoS ONE,

4:e4345.

Figure 1 Network analysis of proteins from Paulinella ovalis-like Cells 1–6. A. Network representation of genome data from six P. ovalis-like cells

and from Bigelowiella natans reveals the extent of protein sequence sharing between the targeted genomes. Circles with pie charts show the

level of representation of different cells and of B. natans in the communities. Edges (gray lines) are weighted by the number of sequences con-

necting any two communities. B. Heatmap showing cell-based assortativity coefficients. The arrow identifies the relatively high assortativity of

Cell 1 data. C. Cumulative growth in unique gene families (represented by component counts in the network) based on 454 data from the six

SCGs (light gray panel, black solid line shows the linear fit). The broken line represents the approximate gene family number in the target eukary-

ote that does not continue to grow with the addition of SCG data once the nuclear genome has been fully sequenced with SCG analysis. D.

Scale-free degree distribution of protein connections (k) in the P. ovalis-like cell networks is described as a straight line on this log–log plot.


Journal of Eukaryotic Microbiology 2014, 61, 399–403402

Single Cell Network Analysis Cheng et al.

Bapteste, E., Lopez, P., Bouchard, F., Baquero, F., McInerney, J.

O. & Burian, R. M. 2012. Evolutionary analyses of non-genea-

logical bonds produced by introgressive descent. Proc. Natl

Acad. Sci. U S A, 109:18266–18272.Barab�asi, A. L. & Oltvai, Z. N. 2004. Network biology: understanding

the cell’s functional organization. Nat. Rev. Genet., 5:101–113.Beauregard-Racine, J., Bicep, C., Schliep, K., Lopez, P., Lapointe,

F. J. & Bapteste, E. 2011. Of woods and webs: possible alter-

natives to the tree of life for studying genomic fluidity in E. coli.

Biol. Direct, 6:39.

Bhattacharya, D., Price, D. C., Yoon, H. S., Yang, E. C., Poulton,

N. J., Andersen, R. A. & Das, S. P. 2012. Single cell genome

analysis supports a link between phagotrophy and primary plas-

tid endosymbiosis. Sci. Rep., 2:356.

Bhattacharya, D., Price, D. C., Bicep, C., Bapteste, E., Sarwade,

M., Rajah, V. D. & Yoon, H. S. 2013. Identification of a marine

cyanophage in a protist single-cell metagenome assembly. J.

Phycol., 49:207–212.Blondel, V., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. 2008.

Fast unfolding of communities in large networks. J. Stat.

Mech., 2008:P10008.

Conesa, A., Gotz, S., Garcia-Gomez, J. M., Terol, J., Talon, M. &

Robles, M. 2005. Blast2GO: a universal tool for annotation,

visualization and analysis in functional genomics research. Bioin-

formatics, 21:3674–3676.Curtis, B., Tanifuji, G., Burki, F., Gruber, A., Irimia, M., Maruyama,

S., Arias, M., Ball, S., Gile, G., Hirawaka, Y., Hopkins, J., Kuo,

A., Ransing, S., Schmutz, J., Symeonidi, A., Elias, M., Eveleigh,

R., Herman, E., Klute, M., Nakayama, T., Obornik, M., Reyes-

Prieto, A., Armbrust, E. V., Aves, S., Beiko, R. G., Coutinho, P.,

Dacks, J. B., Durnford, D. G., Fast, N. M., Green, B. R., Gris-

dale, C. J., Hempel, F., Henrissat, B., H€oppner, M. P., Ishida,

K., Kim, E., Ko�ren�y, L., Kroth, P. G., Liu, Y., Malik, S. B., Maier,

U. G., McRose, D., Mock, T., Neilson, J. A., Onodera, N. T.,

Poole, A. M., Pritham, E. J., Richards, T. A., Rocap, G., Roy, S.

W., Sarai, C., Schaack, S., Shirato, S., Slamovits, C. H., Spen-

cer, D. F., Suzuki, S., Worden, A. Z., Zauner, S., Barry, K., Bell,

C., Bharti, A. K., Crow, J. A., Grimwood, J., Kramer, R., Lind-

quist, E., Lucas, S., Salamov, A., McFadden, G. I., Lane, C. E.,

Keeling, P. J., Gray, M. W., Grigoriev, I. V. & Archibald, J. M.

2012. Algal genomes reveal evolutionary mosaicism and the

fate of nucleomorphs. Nature, 492:59–65.Halary, S., Leigh, J. W., Cheaib, B., Lopez, P. & Bapteste, E.

2010. Network analyses structure genetic diversity in indepen-

dent genetic worlds. Proc. Natl Acad. Sci. U S A, 107:127–132.Kalisky, T., Blainey, P. & Quake, S. R. 2011. Genomic analysis at

the single-cell level. Annu. Rev. Genet., 45:431–445.Komurov, K., Dursun, S., Erdin, S. & Ram, P. T. 2012. NetWalker:

a contextual network analysis tool for functional genomics.

BMC Genomics, 13:282.

Li, W. & Godzik, A. 2006. CD-HIT: a fast program for clustering

and comparing large sets of protein or nucleotide sequences.

Bioinformatics, 22:1658–1659.Luscombe, N. M., Qian, J., Zhang, Z., Johnson, T. & Gerstein, M.

2002. The dominance of the population by a selected few:

power-law behaviour applies to a wide variety of genomic prop-

erties. Genome Biol., 3:RESEARCH0040.

Mosavi, L., Minor Jr, D. & Peng, Z. 2002. Consensus-derived

structural determinants of the ankyrin repeat motif. Proc. Natl

Acad. Sci., 99:16029–16034.Newman, M. E. J. 2003. Mixing patterns in networks. Phys. Rev.

E, 67:026126.

Parra, G., Bradnam, K. & Korf, I. 2007. CEGMA: a pipeline to

accurately annotate core genes in eukaryotic genomes. Bioin-

formatics, 23:1061–1067.Proulx, S. R., Promislow, D. E. & Phillips, P. C. 2005. Network

thinking in ecology and evolution. Trends Ecol. Evol., 20:345–353.

Rice, P., Longden, I. & Bleasby, A. 2000. EMBOSS: The European

Molecular Biology Open Software Suite. Trends Genet.,

16:276–277.Rodrigue, S., Malmstrom, R. R., Berlin, A. M., Birren, B. W.,

Henn, M. R. & Chisholm, S. W. 2009. Whole genome amplifica-

tion and de novo assembly of single bacterial cells. PLoS ONE,

4:e6864.

Woyke, T., Tighe, D., Mavromatis, K., Clum, A., Copeland, A.,

Schackwitz, W., Lapidus, A., Wu, D., McCutcheon, J. P.,

McDonald, B. R., Moran, N. A., Bristow, J. & Cheng, J. F.

2010. One bacterial cell, one complete genome. PLoS ONE, 5:

e10314.

Yoon, H. S., Price, D. C., Stepanauskas, R., Rajah, V. D., Sie-

racki, M. E., Wilson, W. H., Yang, E. C., Duffy, S. & Bhattach-

arya, D. 2011. Single cell genomics reveals trophic interactions

and evolutionary history of uncultured protists. Science,

332:714–717.Zhou, J., Deng, Y., Luo, F., He, Z., Tu, Q. & Zhi, X. 2010. Func-

tional molecular ecological networksmBio, 1:e00169-10.

Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. 2012. Genome-wide

detection of single-nucleotide and copy-number variations of a

single human cell. Science, 338:1622–1626.

SUPPORTING INFORMATION

Additional Supporting Information may be found in the

online version of this article:

Fig. S1. Hierarchically clustered taxonomic distribution of

single cell data from P. ovalis-like Cells 1–6. This map is

based on BLASTp hits of the predicted proteins. Note that

Cell 1 has a relatively far distance from the other five cells

mainly due to its extremely high proportion of proteobac-

terial hits. Bacteria/phages are marked as red text and

eukaryotes as green text.

Table S1. Assembly output for the six Paulinella ovalis-like

cells studied in our work. The 454 reads were assembled

using the native Roche GS De Novo Assembler Software

V2.5 beta.



Cheng et al. Single Cell Network Analysis

Documents

Exploring Biotic Interactions Within Protist Cell Populations Using Network Methods