Click here to load reader

2014 sage-talk

Embed Size (px)

Citation preview

  • 1.Making assembly cheap & easy, and consequences thereof C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Feb 2014 [email protected]

2. Generally, yay #openscience! Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/research.html Preprints: on arXiv, q-bio: diginorm arxiv 3. Problem under consideration: shotgun metagenomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze.Sequence it all and let the bioinformaticians sort it out Wikipedia: Environmental shotgun sequencing.png 4. Analogy: we seek an understanding of humanity via our libraries.http://eofdreams.com/library.html; 5. But, our only observation tool is shredding a mixture of all of the books & digitizing the shreds.http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/ 6. Points: Lots of fragments needed! (Deep sampling.) Having read and understood some books will help quite a bit (Prior knowledge.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. A categorization system would be an invaluable but not infallible guide to book topics. Understanding the language would help you validate & understand the books. 7. Investigating soil microbial communities 95% or more of soil microbes cannot be culturedin lab. Very little transport in soil and sediment => slow mixing rates. Estimates of immense diversity: Billions of microbial cells per gram of soil. Million+ microbial species per gram of soil (Gans etal, 2005) One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory) 8. By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.Microbies live in & on: Surfaces of aggregate particles; Pores within microaggregates;N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h tml 9. Questions to address Role of soil microbes in nutrient cycling: How does agricultural soil differ from native soil? How do soil microbial communities respond toclimate perturbation? Genome-level questions: What kind of strain-level heterogeneity is present inthe population? What are the phage and viral populations & dynamic? What species are where, and how much is shared between different geographical locations? 10. Must use culture independent and metagenomic approaches Many reasons why you cant or dont want toculture: Cross-feeding, niche specificity, dormancy, etc. If you want to get at underlying function, 16sanalysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities. 11. Shotgun metagenomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze.Sequence it all and let the bioinformaticians sort it out Wikipedia: Environmental shotgun sequencing.png 12. Computational reconstruction of (meta)genomic content.http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/ 13. Points: Lots of fragments needed! (Deep sampling.) Having read and understood some books will help quite a bit (Reference genomes.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. (Sequencing error) The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. (We dont understand most microbial function.) A categorization system would be an invaluable but not infallible guide to book topics. (Phylogeny can guide interpretation.) Understanding the language would help you validate 14. Great Prairie Grand Challenge --SAMPLING LOCATIONS2008 15. A Grand Challenge dataset (DOE/JGI) Total: 1,846 Gbp soil metagenome 600MetaHIT (Qin et. al, 2011), 578 GbpBasepairs of Sequencing (Gbp)500400Rumen (Hess et. al, 2011), 268 Gbp300200Rumen K-mer Filtered, 111 Gbp100NCBI nr database, 37 Gbp0 Iowa, Iowa, Native Kansas, Continuous Prairie Cultivated corn cornKansas, Native Prairie GAIIWisconsin, Wisconsin, Wisconsin, Wisconsin, Restored Switchgrass Continuous Native corn Prairie PrairieHiSeq 16. Why do we need so much data?! 20-40x coverage is necessary; 100x is ~sufficient. Mixed population sampling => sensitivity driven bylowest abundance. For example, for E. coli in 1/1000 dilution, you wouldneed approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence! (For soil, estimate is 50 Tbp) Sequencing is straightforward; data analysis is not.$1000 genome with $1m analysis 17. Great Prairie Grand Challenge goals How much of the source metagenome can wereconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest soil data set ever sequenced, ~2010.) What can we learn about soil from looking at thereconstructed metagenome? (See list of questions) 18. Assembly graphs scale with data size, not information.Conway T C , Bromage A J Bioinformatics 2011;27:479-486 The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 19. The Problem We can cheaply gather DNA data in quantitiessufficient to swamp straightforward assembly algorithms running on commodity hardware. No locality to the data in terms of graph structure. Since ~2008: The field has engaged in lots of engineeringoptimization but the data generation rate has consistently outstripped Moores Law. 20. Several solutions 1.More efficient exploration of data. 2. Subdivide data3. Discard redundant data. 21. Primary approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Diversity vs richness.The high-coverage reads in sample A are unnecessary for 22. Digital normalization 23. Digital normalization 24. Digital normalization 25. Digital normalization 26. Digital normalization 27. Digital normalization 28. Diginorm is lossy compression Nearly perfect from an information theoreticperspective: Discards 95% more of data for genomes. Loses < 00.02% of information. 29. Where are we taking this? Streaming online algorithms only look at data~once. Diginorm is streaming, online Conceptually, can move many aspects ofsequence analysis into streaming mode. => Extraordinary potential for computational efficiency. 30. => Streaming, online variant calling.Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful. 31. Prospective: sequencing tumor cells Goal: phylogenetically reconstruct causal drivermutations in face of passenger mutations. 1000 cells x 3 Gbp x 20 coverage: 60 Tbp ofsequence. Most of this data will be redundant and not useful. Developing diginorm-based algorithms toeliminate data while retaining variant information. 32. The real challenge: understanding We have gotten distracted by shiny toys:sequencing!! Data!! Data is now plentiful! But: We typically have no knowledge of what > 50% ofan environmental metagenome means, functionally. Most data is not openly available, so we cannot mine correlations across data sets. Most computational science is not reproducible, so I cant reuse other peoples tools or approaches. 33. Data intensive biology & hypothesis generation My interest in biological data is to enable betterhypothesis generation. 34. My interests Open source ecosystem of analysis tools. Loosely coupled APIs for querying databases. Publishing reproducible and reusable analyses,openly. Education and training.Platform perspective 35. Practical implications of diginorm Data is (essentially) free; For some problems, analysis is now cheaperthan data gathering (i.e. essentially free); plus, we can run most of our approaches inthe cloud. 36. khmer-protocols Read cleaning Effort to provide standard cheapassembly protocols for the cloud. Diginorm Entirely copy/paste; ~2-6 days fromraw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set. Open, versioned, forkable, citable.AssemblyAnnotationRSEM differential expression 37. CC0; BSD; on github; in reStructuredText. 38. Can we incentivize data sharing? ~$100-$150/transcriptome in the cloud Offer to analyze peoples existing data forfree, IFF they open it up within a year. See: Dead Sea Scrolls & Open Marine Transcriptome Project blog post; CephSeq white paper. 39. Research singularity The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary: The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data. 40. My interests Open source ecosystem of analysis tools. Loosely coupled APIs for querying databases. Publishing reproducible and reusable analyses,openly. Education and training.Platform perspective 41. IPython Notebook: data + code => IPython)Notebook) 42. Acknowledgements Lab members involved Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh ShenemanCollaborators Jim Tiedje, MSU Susannah Tringe and Janet Jansson (JGI, LBNL) Erich Schwarz, Caltech / Cornell Paul Sternberg, Caltech Robin Gasser, U. Melbourne Weiming Li, MSU Shana Goffredi, OccidentalFundingUSDA NIFA; NSF IOS; NIH; BEACON.