7
16 Features February 2014 © Biochemical Society Molecular Evolution Abbreviations: FACS, fluorescence-activated cell sorting; HGT, horizontal gene transfer; MDA, multiple displacement amplification; NGS, next-generation sequencing; rDNA, ribosomal DNA; Rep, replication-associated protein; SCG, single-cell genomics; WGA, whole genome amplification. Key words: eukaryote, metagenome, picobiliphyte, Sanger dideoxy sequencing Studying the single life of eukaryotic microbes Single-cell genomics of marine plankton The oceans are full of innumerable numbers of single cells living in microenvironments. Understanding who they are, what they eat and what infects them can inform us about the true diversity of plankton, their biotic interactions and how they may respond to a changing environment. Analysing to significant depth the genomes and ‘gut’ (i.e. the food vacuole and other contents) of individual wild-caught cells would have been thought impossible only a few years ago. However, the rapidly expanding field of single-cell genomics, powered by modern cell-sorting procedures, high-throughput DNA sequencing and bioinformatics methods holds the promise to revolutionize understanding of the biodiversity and ecology of eukaryotic microbes and their places in the tree of life. genomes (cheaper to sequence) and gene inventories associated with highly specialized parasites. is all began to change as the current DNA sequencing revolution took hold and it became possible to generate billions (if not trillions) of bases of sequence data for thousands (if not hundreds) of dollars. is revolution continues and per-base sequencing costs continue their decline as next-generation sequencing (NGS) platforms produced by Illumina, Ion Torrent, Pac-Bio and ABI among others (e.g. nanopore DNA sequencing) continue to develop. For example, Sanger sequencing costs $1500/million base pairs of DNA (Mbp), whereas the current Illumina MiSeq and the Ion Torrent PGM technologies cost approximately $0.50/Mbp and $0.63/Mbp respectively. As the NGS revolution unfolded, scientists began to sequence mixtures of total DNA sampled from the natural environment (i.e. metagenomes) and to computationally connect (i.e. assemble) overlapping genome sequencing reads into fragments (contigs). Taxonomic assignment of the encoded gene inventories, without knowing exactly which cells gave rise to which pieces of DNA, was an oſten imprecise process based strictly on sequence similarity. e first and most famous of these experiments was the Sorcerer II global circumnavigation led by J. Craig Venter that relied on Sanger sequencing to produce 7.7 million random shotgun DNA sequences totalling 6.3 Gbp of data 2 . is massive collection of data, by past Biologists have long struggled to catalogue unicellular eukaryotic life, such as the tiny heterotrophs, bacterial- sized phytoplankton, diatoms and dinoflagellates oſten collectively referred to as protists; not primarily because these lineages are difficult to access – many are ubiquitous – but rather because few tools existed that would allow the in-depth study of uncultured single cells. It is now widely recognized that the vast majority of microbes cannot be cultured using current methods 1 . is limitation means that significant DNA or RNA data (in this case, genome wide) is retrievable only from multicellular lineages or from species for which millions of cultured cells are on hand to apply what are now referred to as ‘classical’ molecular biology approaches, such as plasmid libraries of cloned cDNA and genomic DNA, and Sanger dideoxy sequencing. As a result, our understanding of the biology and diversity of microbial eukaryotes had relied for many years on a handful of free-living or important parasitic species (e.g. the diatom alassiosira pseudonana, the extremophilic red alga Cyanidioschyzon merolae, the green alga Chlamydomonas reinhardtii, common baker’s yeast Saccharomyces cerevisiae, giardiasis agent Giardia lamblia and the malarial parasite Plasmodium falciparum) that could be brought into long-term culture and studied in detail. An outcome of this reliance on ‘weeds’ or ‘pests’ to understand eukaryotes and to populate the eukaryotic tree of life was that our perspectives were skewed by small Debashish Bhattacharya, Rajat S. Roy, Dana C. Price and Alexander Schliep (Rutgers University, USA)

Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

16

Features

February 2014 © Biochemical Society

Molecular Evolution

Abbreviations: FACS, fluorescence-activated cell sorting; HGT, horizontal gene transfer; MDA, multiple displacement amplification;

NGS, next-generation sequencing; rDNA, ribosomal DNA; Rep, replication-associated protein; SCG, single-cell genomics; WGA, whole

genome amplification.

Key words: eukaryote,

metagenome, picobiliphyte,

Sanger dideoxy sequencing

Studying the single life of eukaryotic microbes

Single-cell genomics of marine plankton

The oceans are full of innumerable numbers of single cells living in microenvironments. Understanding who they are, what they eat and what infects them can inform us about the true diversity of plankton, their biotic interactions and how they may respond to a changing environment. Analysing to significant depth the genomes and ‘gut’ (i.e. the food vacuole and other contents) of individual wild-caught cells would have been thought impossible only a few years ago. However, the rapidly expanding field of single-cell genomics, powered by modern cell-sorting procedures, high-throughput DNA sequencing and bioinformatics methods holds the promise to revolutionize understanding of the biodiversity and ecology of eukaryotic microbes and their places in the tree of life.

genomes (cheaper to sequence) and gene inventories associated with highly specialized parasites. This all began to change as the current DNA sequencing revolution took hold and it became possible to generate billions (if not trillions) of bases of sequence data for thousands (if not hundreds) of dollars. This revolution continues and per-base sequencing costs continue their decline as next-generation sequencing (NGS) platforms produced by Illumina, Ion Torrent, Pac-Bio and ABI among others (e.g. nanopore DNA sequencing) continue to develop. For example, Sanger sequencing costs $1500/million base pairs of DNA (Mbp), whereas the current Illumina MiSeq and the Ion Torrent PGM technologies cost approximately $0.50/Mbp and $0.63/Mbp respectively.

As the NGS revolution unfolded, scientists began to sequence mixtures of total DNA sampled from the natural environment (i.e. metagenomes) and to computationally connect (i.e. assemble) overlapping genome sequencing reads into fragments (contigs). Taxonomic assignment of the encoded gene inventories, without knowing exactly which cells gave rise to which pieces of DNA, was an often imprecise process based strictly on sequence similarity. The first and most famous of these experiments was the Sorcerer II global circumnavigation led by J. Craig Venter that relied on Sanger sequencing to produce 7.7 million random shotgun DNA sequences totalling 6.3 Gbp of data2. This massive collection of data, by past

Biologists have long struggled to catalogue unicellular eukaryotic life, such as the tiny heterotrophs, bacterial-sized phytoplankton, diatoms and dinoflagellates often collectively referred to as protists; not primarily because these lineages are difficult to access – many are ubiquitous – but rather because few tools existed that would allow the in-depth study of uncultured single cells. It is now widely recognized that the vast majority of microbes cannot be cultured using current methods1. This limitation means that significant DNA or RNA data (in this case, genome wide) is retrievable only from multicellular lineages or from species for which millions of cultured cells are on hand to apply what are now referred to as ‘classical’ molecular biology approaches, such as plasmid libraries of cloned cDNA and genomic DNA, and Sanger dideoxy sequencing. As a result, our understanding of the biology and diversity of microbial eukaryotes had relied for many years on a handful of free-living or important parasitic species (e.g. the diatom Thalassiosira pseudonana, the extremophilic red alga Cyanidioschyzon merolae, the green alga Chlamydomonas reinhardtii, common baker’s yeast Saccharomyces cerevisiae, giardiasis agent Giardia lamblia and the malarial parasite Plasmodium falciparum) that could be brought into long-term culture and studied in detail. An outcome of this reliance on ‘weeds’ or ‘pests’ to understand eukaryotes and to populate the eukaryotic tree of life was that our perspectives were skewed by small

Debashish Bhattacharya, Rajat S. Roy, Dana C. Price and Alexander Schliep (Rutgers University, USA)

Page 2: Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

17

Features

February 2014 © Biochemical Society

Molecular Evolution

standards, necessitated development of the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) database for its management. Since then, 116 additional public metagenomic datasets (http://camera.calit2.net/camdata.shtm; retrieved 8 September 2013) that have primarily relied on NGS methods have been added to the site. In addition, entities such as the National Center for Genome Research have undertaken ventures to sequence 700 algal species from 200 genera and the Tara Oceans Expedition (http://oceans.taraexpeditions.org/) has begun analysis of 27 000 metagenomic samples gathered over a 2-year period from an around-the-globe collection regime.

Given access to unprecedented amounts of cheap sequence data and methods to access environmental DNA, a further breath-taking development in this area was the rise of single-cell genomics (SCG) and transcriptomic methods. SCG was recently named the ‘Method of the Year 2013’ by Nature Methods (www.nature.com/nmeth/journal/v11/n1/full/nmeth.2801.html) and is the result of key technological developments. These include the ability to isolate single cells with high fidelity using fluorescence-activated cell sorting (FACS) or microfluidic methods, followed by whole genome amplification (WGA) using multiple displacement amplification (MDA) or the MALBAC procedure. These approaches allow the generation of micrograms of DNA from the trace amounts of nucleic acids derived from a single cell. The union of

NGS, FACS/microfluidics and WGA, powered by rapidly developing bioinformatics methods has the potential to fundamentally change our knowledge of microbial eukaryote biology, biodiversity and phylogeny. Below, we briefly describe the strengths and limitations of SCG data and some solutions to their shortfalls that are under development. Thereafter, we describe collaborative work that our laboratory has been involved in which highlights some interesting results gained from SCG research with planktonic species.

Some challenges and pitfalls when assembling SCG data

Although it has become routine to use WGA to generate sufficient DNA from single cells for NGS, these approaches do have shortcomings. Most significantly, MDA methods often result in massive fluctuation in the amount of DNA that is amplified from different genome regions leading to coverage bias3 (in the present article, coverage refers to the number of sequence reads that align on average beneath each nucleotide in an assembled piece of DNA; higher uniform coverage is better). Error correction tools rely on statistical methods designed with implicit assumptions about the coverage distribution of the read library. Therefore special error correction methods are required to correct reads in WGA-amplified libraries that take their uneven coverage into account.

DNA emerging from a single cell (image by Susanne Ruemmele, Bhattacharya Laboratory)

Page 3: Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

18

Features

February 2014 © Biochemical Society

Molecular Evolution

Assembly tools also utilize coverage information to resolve repeats in genome assembly and thus, have to be specially designed to produce reasonable draft genome assemblies from sequence libraries with uneven coverage. Given these constraints, specialized SCG assemblers such as SPAdes 2.44 and IDBA-UD have been developed that demonstrates greater ability to reconstruct contiguous DNA fragments when assembling bacterial single cell libraries (http://bioinf.spbau.ru/spades/). However, due

to genome features such as tandem/microsatellite repeats, base compositional biases and intron content, generating high-quality eukaryotic single-cell assemblies is still a challenge and remains an active field of research.  What is clear is that with eukaryotic MDA samples, current bioinformatics tools often provide fragmented and/or incomplete (genomes in thousands of pieces and missing regions), chimaeric (disparate genome regions co-assembled artificially into single fragments) or over-

Figure 1. Analysis of SCG data from individual picobiliphyte cells. (a) Partial maximum likelihood phylogeny of picobiliphyte small subunit rDNA coding regions. The results of RAxML bootstrap values (when ≥60%) are shown above the branches. GenBank® numbers are shown for each sequence. The three MDA samples discussed in the present article are shown in boldface. Note that the cells were derived from the same Boothbay Harbor water sample, yet they represent the three most divergent lineages of picobiliphytes that have a worldwide distribution. (b) Taxonomic distribution of BLASTx hits (when ≥10; if ≤10, different hits are ‘Others’) of 454-derived contigs from each picobiliphyte SCG assembly. The total number of hits (green bars) and the non-redundant (unique) gene hits (blue bars) are shown for each cell. There is clear enrichment of some groups such as vira in MS584-5 and Bacteroidetes in MS584-22 that may be explained by MDA bias. (c) Genome structure of the novel ssDNA virus found in cell MS584-5. (d) Partial maximum likelihood tree of viral Rep proteins from representative ssDNA viruses showing the phylogenetic position of the MS584-5-associated sequence. The bootstrap values (when ≥60%) above the branches are from an RAxML analysis.

Page 4: Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

19

Features

February 2014 © Biochemical Society

Molecular Evolution

assembled (draft assembly being significantly larger than the reference assembly) genomes (D. Bhattacharya, R.S. Roy, D.C. Price and A. Schliep, unpublished work). Given the wealth of information that can be gleaned from cells using SCG, the challenge is to maximize the quantity and quality of the information that is extracted out of such assemblies and thereby explore the biology and evolution of unculturable organisms.

Picobiliphytes: the kingdom that lost photosynthesis

We will describe below some successes of SCG that hint at its future promise with the first case study being the picobiliphytes. These protists, although not widely known outside of marine ecology and oceanography circles, are of high interest because they were described in 2007 as a novel lineage of unicellular pigmented eukaryotes5. The erection of a new branch in the tree of life is always exciting and rare for eukaryotes, and in this case was doubly interesting because the cells were thought to be photosynthetic and therefore must contribute to ocean primary production. Picobiliphytes were initially identified using microscopy and their phylogenetic affinity and identity were determined using 18S ribosomal DNA (rDNA) sequence comparisons and rDNA fluorescent in situ hybridization (FISH) probes respectively5. Although picobiliphyte cell ultrastructure was unknown in 2007, autofluorescence and DAPI DNA staining data suggested that the cells contained a cryptophyte alga-derived photosynthetic organelle (plastid) and associated remnant nucleus of the captured eukaryote5. These taxa could not be cultivated, therefore it was still uncertain if they contained a permanent plastid, a ‘stolen’ temporary plastid or simply consumed cryptophytes as food (i.e. algae localized in a vacuole). To address these issue and to gain basic knowledge about their biology, Yoon et al.6 analysed three individual picobiliphyte cells (informally named Boothbay MS584-5, Boothbay MS584-11 and Boothbay MS584-22) that were isolated from a 50 ml water sample collected off a dock in Boothbay Harbor in the Gulf of Maine, USA (Figure 1a). In this first attempt at protist SCG, approximately 100 Mbp of Roche 454 (NGS) data were generated from DNA libraries made from each cell MDA sample. These reads were then used to generate a draft assembly of each picobiliphyte. Analysis of these 454 SCG data (later supplemented with significant Illumina sequencing reads) showed the absence of plastid DNA or nuclear genes related to photosynthesis in the draft assemblies, suggesting that picobiliphytes are heterotrophs and that the DAPI staining evidence of a cryptophyte-derived plastid was likely explained by food DNA. The finding of plastid absence in picobiliphytes was recently validated by Seenivasan et al.7 who did an

exhaustive electron microscopic analysis of captured cells and renamed the group ‘picozoans’ to recognize their phagotrophic lifestyle.

An equally intriguing aspect of this initial SCG study of unicellular eukaryotes was the dominance in one cell (MS584-5) by reads from a novel ssDNA nanovirus (Figure 1b), most likely representing host cell infection. This result suggested that natural pathogens or symbionts of cells could be identified in situ without cultivation. A putative replication-associated protein (Rep) encoded on the nanovirus genome turned out to be widely distributed in ocean environments when the Rep sequence was used to query marine metagenome data (Figure 1c). Interestingly, a relative of the picobiliphyte Rep protein is present on the genome of a large DNA virus that infects the bloom forming haptophyte algae, Phaeocystis globosa (GenBank® accession number YP_008052687). In the second cell (MS584-11), a significant fraction of DNA sequences were related to Bacteroidetes and associated phages, suggesting this protist grazed phage-infected bacteria. The third picobiliphyte cell (MS584-22) also contained DNA originating from Bacteroidetes as well as from large DNA viruses (Figure 1b), both of which are likely to be protist food sources. This ‘snapshot-in-time’ SCG approach uncovered unexpectedly complex biotic interactions unique to each cell without the introduction of cultivation biases. These data also demonstrated that individual heterotrophic cells captured in the ocean are not pristine sources of ‘host’ DNA, but rather may represent ‘single-cell metagenomes’ that can inform us about the biology of the lineages. Nonetheless, the latter interpretation of the picobiliphyte SCG data has the obvious bugaboo of contamination, either through the co-sorting of non-target cells (unlikely for large-sized eukaryotes) or more likely, due to the intimate association with the protist cell wall of prokaryotes or phages/viruses that become part of the MDA sample, but may not be of biological relevance. The latter issue is difficult to address, but the finding that different cells isolated from the same water sample have distinct foreign DNAs associated with them argues for (but does not prove) biological relevance. More troubling would be if a bacterium and/or phage was universally present in each MDA pool and represented a common environmental contaminant in all FACS samples.

You become what you eat, literally

The notion that we can understand the biology of marine taxa without cultivation was taken in another direction in two separate studies carried out by our group on a lineage of marine amoebae that holds a key position in the tree of life. These taxa, named Paulinella ovalis (or P. ovalis-like for formally undescribed sister taxa) are phagotrophs

Page 5: Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

20

Features

February 2014 © Biochemical Society

Molecular Evolution

chromatophora) being sister to phagotrophs that eat cyanobacteria (P. ovalis), inspired us to look into the feeding habits of P. ovalis-like cells. The simple question was: can we show a direct connection between food and organelle source facilitated by phagotrophy? This question again had to be answered using SCG because P. ovalis has never been cultured in the laboratory. To study the role of feeding behaviour in plastid origin, we generated SCG data from six P. ovalis-like cells isolated from Chesapeake Bay, USA. Analysis of the assembled contigs from these cells showed that many were derived from prey DNA of alpha-cyanobacterial origin (i.e. Prochlorococcus or Synechococcus species) and, surprisingly, their associated

(just like picozoans) that eat cyanobacteria. Perhaps most important from the evolutionary perspective is the fact that these phagotrophic amoebae are sister to the only other known clade of eukaryotes (i.e. besides algae and plants) to have harnessed a cyanobacterium as a plastid, Paulinella chromatophora (Figure 2a). Described in 1895 by the German naturalist Robert Lauterborn8, these siliceous scale-bearing amoeba contain a bright green organelle (‘chromatophore’) derived from an alpha-cyanobacterium that functions as a permanent plastid, allowing P. chromatophora to live autotrophically. This unique situation, i.e. a photosynthetic cell with a permanent cyanobacterium-derived plastid (P.

Figure 3. Analysis of a cyanophage genome assembly derived from P. ovalis-like cell 1 SCG data. (a) Cumulative length of the predicted phage PoL_MC2 genome using the 19 contigs that had top hits to complete cyanophage genome sequences. Note that 50% of the putative PoL_MC2 genome is encoded on the four largest contigs. (b) NeighborNet analysis of a multi-protein dataset (cobS, mazG, hsp20, phoH and PsbA; 1074 amino acids) using SplitsTree4 with the exclusion of gaps and parsimony–uninformative sites. The values at the branches (when ≥50%) are RAxML bootstrap values. The major recognized clades of cyanophages are shown in different colours.

Figure 2. Analysis of SCG data from heterotrophic P. ovalis-like cells. (a) Maximum likelihood tree inferred from small subunit rDNA showing the phylogenetic position of heterotrophic P. ovalis-like cells related to photosynthetic P. chromatophora (green text) within the kingdom Rhizaria. Single-cell sorting identified several P. ovalis-like cells that comprise two distinct heterotrophic Paulinella clades (Clade 1 and Clade 2) of which Clade 1 is most closely related to the photosynthetic taxa. Cell 1 (red text) from this clade was the focus of our in-depth study. RAxML bootstrap values are shown above and below the branches when ≥60%. The GenBank® accession numbers are shown after taxon names. (b) An assembled contig from the nuclear genome of P. ovalis-like cell 1 that contains three genes, two of which are of amoeba origin and one in the middle of the contig that is of alpha-cyanobacterial origin. The intron distribution (exons shown as green boxes) and average genome coverage of the contig are also shown.

Page 6: Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

21

Features

February 2014 © Biochemical Society

Molecular Evolution

3a). The largest assembled cyanophage genome fragment was 42 225 bp in size with average coverage of 4228× and encoded 44 proteins. Because no DNA polymorphisms were found among the thousands of assembled reads for this contig, we hypothesized that a single genotype of PoL_MC2 was likely present in the P. ovalis-like cell. To place this novel cyanophage in an evolutionary context, we used five concatenated ‘photosynthetic phage’ genes (cobS, hsp20, mazG, phoH, psbA and HLIP) to infer its phylogenetic position. This tree shows PoL_MC2 to be most closely related to S-SM2 and P-SSM2 in the marine cyanomyovirus clade MC2 (Figure 3b).

A potential application of SCG: on-the-fly pathogen genome assembly

The potentially transformative impact of SCG on the fields of ecology, systematics and environmental biology should (we hope) be obvious by now to the reader. However, SCG is also revolutionizing medical research (e.g. see the Nature Methods special issue for details). Rather than summarizing this rapidly expanding field, in the present article, we describe a potential application of SCG to screening for emerging pathogens that currently is under development in our group. In this case, it would be highly advantageous if computational analysis can be done simultaneously with sequencing reactions instead of post-data collection to facilitate rapid identification of pathogens. With current technology, such as the Illumina MiSeq instrument, partial reads can be extracted after initial cycles are completed and online base-calling can be done external to the sequencer, yielding partial reads of increasing length during the sequencing run. Single-cell assembly is usually already performed in iterations for k-mers (a specific n-tuple of choice used to search for matching regions of DNA during assembly) of increasing length, typically 21, 31 and 41, up

cyanophages (see below). Therefore a direct line of inference could be made from amoeba food source to plastid origin through phagotrophy. Furthermore, we found two examples of horizontal gene transfer (HGT) in our P. ovalis-like genome data. An example shown in Figure 2(b) illustrates an alpha-cyanobacterial-derived gene encoded on a P. ovalis-like genome contig. Because this gene was derived from the same cyanobacterial lineage as the prey DNA recovered in our sequencing, we reasoned that the HGT donor had in the past likely been ingested by P. ovalis-like taxa. This work provided the first evidence of a direct link between feeding behaviour in wild-caught cells, HGT and plastid primary endosymbiosis in a photosynthetic lineage9.

Now what about the cyanophage that was identified in the P. ovalis-like cells? The finding of Prochlorococcus/Synechococcus DNA in the amoeba draft assemblies led us to search these contigs for hits to T4-like marine cyanobacterial myoviruses (dsDNA viruses) that are referred to as cyanophages. These phages specifically infect alpha-cyanobacteria and play a crucial role in the global ocean carbon cycle by causing lysis and export to the deep ocean of carbon derived from a dominant form of marine plankton. The search strategy was to look for contigs ≥3 Kbp in size with an average coverage ≥10×, and the presence of at least one cyanophage gene. Using a combination of 454 and Illumina reads from one P. ovalis-like MDA sample (4.7 Gbp of data), this analysis turned up a total of 208 741 bp (19 contigs) of putative cyanophage DNA that encoded 179 proteins10. Given that many cyanophage genomes range from approximately 170 to 200 Kbp in size, it seemed that we had captured the majority of one genome in our SCG data. When we summed contig lengths, we found that 50% of the partially assembled cyanophage genome (provisionally named, phage PoL_MC2) was encoded on the four largest contigs (Figure

Table 1. Iterative assembly statistics with artificially truncated sequencing reads (to simulate real partial data) and corresponding percentage of true frequent k-mers (i.e. that can be exactly mapped to the reference E. coli genome) in those reads. The iterative assembly was done using ABYSS and the frequent k-mer counting step was carried out using scTurtle64 (R.S. Roy, D. Bhattacharya and A. Schliep, unpublished method).

Truncated read length (nt) Assembly quality Percentage of true frequent

k-mers

N50 Number of contigs Average length When k=31 When k=41

35 11 686 2243 1536 88.41 0

55 30 833 1043 3311 98.51 96.78

65 32 063 879 3934 99.21 98.6

Page 7: Molecular Evolution Single-cell genomics of marine planktondblab.rutgers.edu/home/downloads/Files/Bhattacharya_etal... · 2014-05-05 · Molecular Evolution Assembly tools also utilize

22

Features

February 2014 © Biochemical Society

Molecular Evolution

to the full sequence read length (which is usually 100 nt, 150 nt or 300 nt for Illumina machines), with short k-mers helping to recover low coverage regions and longer k-mers improving the length of the contiguous assembled genome fragments (see Table 1).

To explore this behaviour, we performed a simulation to demonstrate that the iterative assembly process can be successfully carried out on partial (i.e. incomplete) sequencing reads as they accumulate, without having to wait for the entire run to be completed (e.g. 2 days or longer). For a MDA-amplified Escherichia coli library (average genome coverage of 761×, sequencing read length of 100 nt), Table 1 shows that starting from partial reads of length 35 nt (with k=31), the N50 (i.e. in genomics, the contig length at which at least one-half of the total length of all contigs is contained; higher is better) is greater than 10 000 nt and the average contig length is greater than 1500 nt. The reason for this is that a very large proportion of frequently occurring k-mers are already observed in partial reads which are not much longer than k (Table 1); in other words, longer reads (i.e. the full sequencing run) do not contribute much information about short k-mers used in the early stages of the assembly. Therefore iterative assembly with partial reads is feasible to support rapid response SCG. This approach can be enhanced to adapt existing bioinformatics tools to deal more effectively with partial reads and provide higher N50 values. One obvious application of such a tool would be to identify, rapidly, the genome of a pathogen using an isolated single cell (i.e. be it of human origin or from an environmental source).

In summary, given the long-term interest of our group in endosymbiosis and genome evolution, the picobiliphyte SCG data6 allowed us to uncover the complex biotic interactions of wild-caught heterotrophs, a process that helped us better understand the connection between phagotrophy and plastid primary endosymbiosis in the Paulinella lineage9. Furthermore, by generating several gigabases of data from one P. ovalis-like cell, we were able not only to recover significant alpha-cyanobacterial (prey) DNA from the MDA sample, but also to putatively drill into the prey to assemble a near-complete infecting cyanophage genome at a high coverage10. This level of detail into single-cell biology has not been possible in the past and results in (at least) four major insights that can guide future

Rajat Shuvro Roy received his BSc in Computer Science and Engineering from Bangladesh University of Engineering and Technology. At present, he is a PhD student in the Department of Computer Science, Rutgers University–New Brunswick, USA. He is jointly supervised by Professor Alexander Schliep and Professor Debashish Bhattacharya. His research interests are Genome assembly, contig scaffolding, cache efficient algorithms and single-cell

genomics. email: [email protected]

1. DeLong, E.F. (1997) Trends Biotechnol. 15, 203–2072. Rusch, D.B., Halpern, A.L., Sutton, G. et al. (2007) PLoS Biol. 5, e773. Woyke, T., Xie, G., Copeland, A. et al. (2009) PLoS One 4, e52994. Bankevich, A., Nurk, S., Antipov, D. et al. (2012) J. Comput. Biol.

19, 455–4775. Not, F., Valentin, K., Romari, K. et al. (2007) Science 315, 253–255

6. Yoon, H.S., Price, D.C., Stepanauskas, R. et al. (2011) Science 332, 714–7177. Seenivasan, R., Sausen, N., Medlin, L.K. and Melkonian, M. (2013) PLoS One

8, e595658. Lauterborn, R. (1895) Z. Wiss. Zool. 59, 537–5449. Bhattacharya, D., Price, D.C., Yoon, H.S. et al. (2012) Sci. Rep. 2, 35610. Bhattacharya, D., Price, D.C., Bicep, C. et al. (2013) J. Phycol. 49, 207–212

References

Debashish Bhattacharya studied marine ecology and environmental science at Dalhousie University, Canada, and received a PhD in biology from Simon Fraser University, Canada. He is currently a Distinguished Professor in the Department of Ecology, Evolution, and Natural Resources and the Institute of Marine and Coastal Science at Rutgers University, USA. His group works in the fields of algal evolution, endosymbiosis, genomics (both standard and single-cell approaches)

and biofuel research (see http://dblab.rutgers.edu/). email: [email protected]

Dana Price is a laboratory researcher and bioinformatician in the Bhattacharya laboratory at Rutgers University, and is concurrently completing a PhD in vector biology. He works in all aspects of genomics, from isolation and culturing of algae and other protists to next-generation library preparation, sequencing and subsequent bioinformatics analysis. He played a key role in the completion of the first marine eukaryote

genome from a single cell and has worked extensively on glaucophyte and red algal genome biology. email: [email protected]

Alexander Schliep is an Associate Professor in computer science and quantitative biology at Rutgers University. His main research area is bioinformatics and his interests include machine learning, statistical models and algorithms for analysing complex, heterogeneous data from molecular biology. A current focus is on compressive genomics for high-throughput sequencing data and analysis of single-cell genome data.

email: [email protected]

investigations: (i) unicellular eukaryotes have complex biotic interactions in the natural environment, suggesting their genomes must be adapted to these constraints rather than the axenic (i.e. bacteria free) cultures used in laboratory studies. Exploring this aspect of protist biology is in its infancy; (ii) their varied feeding behaviour in the wild may explain why most cells are impossible to cultivate using traditional methods; (iii) the phylogenetically distinct mixtures (e.g. eukaryote, cyanobacterium and cyanophage in P. ovalis-like cells10) of DNAs that come in intimate contact in heterotrophs may provide these cells access to sources of HGT and the genetic innovation and adaptation that can be wrought by foreign DNA; and (iv) the emerging applications of SCG to rapidly generate draft genome data from single cells may allow us to address human health issues in ways that were not possible previously. ■