2014 davis-talk

Embed Size (px)

Citation preview

  • 1.Genomics and bioinformatics in non-model organisms: where is the data tidal wave taking us? C. Titus Brown Assistant Professor Microbiology; Computer Science; BEACON Michigan State University Feb 2014 [email protected]

2. Practical implications of sequencing -Molgula oculataOne graduate student; Two transcriptomes; Three draft genomes; In four years. Molgula oculataMolgula occultaElijah LoweCiona intestinalis 3. Research Agricultural genomics & transcriptomicsMetagenomics (Environmental & host-associated)Novel computational approachesComputing + Biology Education and trainingGood software developmentCapacity buildingEvo-devo genomics & transcriptomicsOpen science/ source/data/ access 4. Research Agricultural genomics & transcriptomicsMetagenomics (Environmental & host-associated)Novel computational approachesComputing + Biology Education and trainingGood software developmentCapacity buildingEvo-devo genomics & transcriptomicsOpen science/ source/data/ access 5. Our research philosophy: Enable good biology by generating hypothesesworth testing. Try to maximize sensitivity of analyses, in light offairly high specificity in sequencing based approaches. Collaborate intensively on research projects. Typically, share graduate students with wet labs. Goal is to cross-train everyone involved. 6. Three mini-stories: 1.Building better gene models for chicken2.Dealing with an endless stream of data3.Evaluating the effect of gene model completeness on pathway prediction. 7. 1. Building a better chicken (gene model) Most extant computational tools focus on modelorganisms.. Assume low polymorphism (internal variation) Assume quality reference genome or transcriptome Assume somewhat reliable functional annotation More significant compute infrastructurerequirements Likit Preeyanon How can we best use mRNAseq for chicken? 8. Interpreting RNAseq requires gene models:http://www.hitseq.com/images/RNA-seq_AS.jp 9. Mareks Disease project: To identify alternative splicing that contributes todisease resistance. w/Hans Cheng, USDA ADOLInbred line 6Inbred line 7 10. Types of Alternative Splicing 40%25% confirms DatasetSingle-end MappedUnmappedPaired-end MappedUnmappedLine 6 uninfected18,375,966 (77.93%)5,203,586 (22.07%)21,598,218 (64.16%)12,065,659 (35.84%)Line 6 infected17,160,695 (73.18%)6,288,286 (26.82%)15,274,638 (63.89%)8633855 (36.11%)Line 7 uninfected18,130,072 (75.77%)5,795,737 (24.22%)20,961,033 (63.67%)11,960,299 (36.33%)Line 7 infected19,912,046 (78.51%)5,450,521 (21.49%)22,485,833 (65.22%)11,992,002 (34.78%) 22. Cross-validation w/read splicing95% of splice junctions have more than three spliced reads 23. Splice junction comparison Assembled transcripts 104,366Genbank mRNA 74,0657,7562,41221,12846,132 17,76534,694110,543 Expressed Sequence Tags 209,134 95% of splice junctions supported by > 4 reads. 24. Gimme pipeline Our pipeline can detect many isoforms Local assembly enhances isoform detection Cufflinks (mapping-based gene models) is notsuperior to de novo transcriptome assembly in chicken (Was Cufflinks trained on mouse/human?) The pipeline can be used to build gene modelsfor other organisms Pipeline can do incremental combining of new data sets 25. How to detectSpliced reads differential splicing2 712 2145 4398 86Read coverage120 45 112 95?230 243 26. Exon Region Comparison2 712 2125 20 23 2098 86Read coverage120 45 112 9540 43203 199 27. Skipped ExonDEXseq 28. Skipped Exon sulfatase 29. BRCA1 domainAlternative 3UTRDNA repair, apoptosis, DNA replication, genome stability 30. Differential Exon Usage Summary Number of exons Adjusted p-valueFalseTrue0.118,631660.0118,656410.00118,66334Chromosome 1 Total 3,728 genesNext steps: scaling analysis to entire genome. And interpretation (??) 31. Gene model thoughts - Can build gene models that represent the datawe have fairly well; Robust exon-exon splice site reporting; Planning ahead for multiple iterations of newdata; interpretation of results? See story 3. 32. 2. Endless data! It is now under $1000 to generate a newmRNAseq data set. Collaborators routinely generate new data setsevery 3-6 months (note: each of them, x 510) How can we make use of this data iteratively!? 33. Making iterative use of new data.Data!Rened gene modelsExisting gene modelsDifferential expression??Some data will yield new gene models, but much will be redundant (e.g. housekeeping genes) 34. Digital normalization 35. Digital normalization 36. Digital normalization 37. Digital normalization 38. Digital normalization 39. Digital normalization 40. Digital normalization approach A digital analog to cDNA library normalization, diginorm: Is single pass: looks at each read only once; Does not collect the majority of sequencingerrors; Keeps all low-coverage reads;Enables analyses that are otherwise completely impossible; Integrated into several assemblers (Trinity and 41. Evaluating on ascidians (sea squirts): Molgula oculataMolgula oculataMolgula occultaCiona intestinalis 42. Diginorm applied to Molgula embryonic mRNAseq set aside ~90% of data No.$ reads Reads$ of$ kept M.# occulta$ F+3 M.# occulta$ F+3 M.# occulta$ F+4 M.# occulta$ F+5 M.# occulta$ F+6 M.# occulta!Total M.# oculata$ F+3 M.# oculata$ F+4 M.# oculata$ F+6 M.# oculata!Total42,174,510 50,018,302 44,948,983 53,692,296 45,782,981 236,617,072 47,045,433 52,890,938 50,156,895 150,093,26615,642,268 6,012,894 3,499,935 2,993,715 2,774,342 30,923,154 10,754,899 3,949,489 2,874,196 17,578,584Percentage$ kept ? ? ? ? ? 13% ? ? ? 11.70% 43. But: does diginorm lose transcript information? No. M. occulta Diginorm Raw3713623C. intestinalisM. oculata Diginorm Raw17missing 2446641364615missing 2398C. intestinalisReciprocal best hit vs. Ciona BLAST e-value cutoff: 1e-6Elijah Lowe 44. Where are we taking diginorm? Streaming online algorithms only look at data~once. Diginorm is streaming, online Conceptually, can move many aspects ofsequence analysis into streaming mode. => Extraordinary potential for computational efficiency. 45. => Streaming, online variant calling.Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful.See NIH BIG DATA grant, http://ged.msu.edu/ 46. Prospective: sequencing tumor cells Goal: phylogenetically reconstruct causal drivermutations in face of passenger mutations. 1000 cells x 3 Gbp x 20 coverage: 60 Tbp ofsequence. Most of this data will be redundant and not useful. Developing diginorm-based algorithms toeliminate data while retaining variant information.See NIH BIG DATA grant, http://ged.msu.edu/ 47. 3. Evaluating effects of gene models on pathway predictionVertically integrated comparison.Likit Preeyanon 48. KEGG Pathway 49. Ensembl Enriched KEGG Pathway TermCountBenjaminCytokine-cytokine receptor interaction366.2E-02Lysosome251.2E-01Apoptosis193.5E-01Arginine and proline metabolism123.1E-01Starch and sucrose metabolism93.4E-01Toll-like receptor signaling pathway193.7E-01Natural killer cell mediated cytotoxicity173.4E-01Cytosolic DNA-sensing pathway94.2E-01Valine, leucine and isoleucine degradation114.1E-01Glutathione metabolism104.3E-01NOD-line receptor signaling pathway114.6E-01Intestinal immune network for IgA production95.6E-01VEGF signaling pathway145.6E-01PPAR signaling pathway136E-01 50. Gimme Enriched KEGG Pathway TermCountBenjaminCytokine-cytokine receptor interaction343.7E-02Toll-like receptor signaling pathway222.7E-02Jak-STAT signaling pathway283.4E-02Arginine and proline metabolism134.5E-02Lysosome221.3E-01Natural killer cell mediated cytotoxicity171.6E-01Alanine, aspartate and glutamate metabolism91.8E-01Amino sugar and nucleotide sugar metabolism103.6E-01Cysteine and methionine metabolism94E-01ECM-receptor interaction163.7E-01Apoptosis163.7E-01Glycosis / Gluconeogenesis114E-01DNA replication83.8E-01Cell adhesion molecules (CAMs)194.6E-01PPAR signaling pathway126E-01Intestinal immune network for IgA production86.1E-01 51. Compared Enriched KEGG Pathway Term Cytokine-cytokine receptor interaction Toll-like receptor signaling pathwayCommonLysosome ApoptosisArginine and proline metabolism Natural killer cells Intestinal immune network for IgA production PPAR signaling pathway Starch and sucroseEnsemblValine, leucine and isoleucine degradation Glutathione metabolism NOD-like receptor signaling pathway VEGF signaling pathway Jak-STAT signaling pathway Alanine, aspartate and glutamate metabolism Amino sugar and nucleotide sugar metabolism ECM-receptor interaction Cell adhesion molecules (CAMs) DNA replicationGimme 52. EnsemblCommonGimme 53. INFB we annotate UTR not present in other gene models. 54. INFB 3 bias + missing UTR => insensitive 55. EnsemblCommonGimme 56. So, where does this leave us? Our methods for generating hypotheses frommRNAseq data are sensitive to references & technical details of the approaches. (This is expected but Bad.) We can build (and have built!) approaches thatwe believe to be more accurate for non- or semimodel organisms. (Theyre also open; try em out.) => Standards for execution, evaluation, comparison, and education. 57. khmer-protocols: Read cleaning Effort to provide standard cheapassembly protocols for the cloud. Diginorm Entirely copy/paste; ~2-6 days fromraw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers) Open, versioned, forkable, citable.(Announced at Davis in December 13!)AssemblyAnnotationRSEM differential expression 58. CC0; BSD; on github; in reStructuredText. 59. Summer NGS workshop (2010-2017) 60. A few thoughts on our approach Explicitly a protocol explicit steps, copy-paste,customizable. No requirement for computational expertise orsignificant computational hardware. ~1-5 days to teach a bench biologist to use. $100-150 of rental compute (cloud computing) for $1000 data set. Adding in quality control and internal validationsteps. 61. Can we crowdsource bioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Lets take advantage of it!) Its as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But thats madness who on Earth would create such an amazing resource?http://thescienceweb.wordpress.com/2014/02/21/bioinfo rmatics-software-companies-have-no-clue-why-no-onebuys-their-products/ 62. Where is the data tidal wave taking biology!? A world with a lot more data, and, eventually, a lotmore information. A more integrative world: genomics, molecularfunction, evolution, population genetics, monitoring, ??, and models that feed back into experimental design. Data-Intensive Biology 63. Data intensive biology & hypothesis generation My interest in biological data is to enable betterhypothesis generation. 64. Additional projects - Bacterial symbionts of bone eating worms w/Shana Goffredi.(ISME, 2013) Genome of Haemonchus contortus, a parasitic nematode (withErich Schwarz and Robin Gasser). (Genome Biology, 2013) Soil metagenome analysis (with Jim Tiedje, Susannah Tringe,and Janet Jansson). (In review, PNAS.) Lamprey transcriptome (with Weiming Li). (in preparation). Ascidian genomes and transcriptomes (with Billie Swalla). (inpreparation) Loligo pealeii (the giant axon squid) 5 transcriptomes and skimgenome posted publicly (Feb 2014). 65. In progress Cattle paratuberculosis analysis (w/PaulCoussens). Improving the chick genome using nth-generationsequencing technology (PacBio, Moleculo). and building software and protocols to make it easy for the next 1000 genomes. 66. % of reads aligningMoleculo data vs chick genome.Luiz IrberRead length 67. What are the challenges ahead? Obviously: Genotype/phenotype mapping. But also: Conserved unknown/unannotatedgenes. Data sharing, and more generally openaccess/data/source/science. Data integration! 68. The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome""...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentuma genomic bandwagon effect."lide courtesy Erich SchwarzRef.: Pandey et al. (2014), PLoS One 11, e88889. 69. Thanks! 70. Thanks! References and grants athttp://ged.msu.edu/research.html Software at http://github.com/ged-lab/ Blog at http://ivory.idyll.org/blog/ Twitter: @ctitusbrownE-mail me: [email protected]