2013 duke-talk

1.Building better genomes, transcriptomes, andmetagenomes with improvedtechniques for de novo assembly -- an easier way to do itC. Titus BrownAssistant ProfessorCSE, MMG, BEACON Michigan State University March [email protected]

2. AcknowledgementsLab members involvedCollaborators Adina Howe (w/Tiedje) Jim Tiedje, MSU Jason Pell Erich Schwarz, Caltech / Arend HintzeCornell Rosangela Canino- Paul Sternberg, CaltechKoning Robin Gasser, U. Qingpeng ZhangMelbourne Elijah Lowe Weiming Li Likit Preeyanon Funding Jiarong Guo Tim BromUSDA NIFA; NSF IOS; Kanchan PavangadkarBEACON. Eric McDonald 3. We practice open science!Be the change you wantEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/interests.html Preprints: on arXiv, q-bio:diginorm arxiv 4. Outline1.Computational challenges associated withsequencing DNA/RNA from non-modelorganisms.2.Three case studies: 1. Parasitic nematode genome assembly 2. Lamprey transcriptome assembly 3. Soil metagenome assembly3.Future directions 5. My interestsI work primarily on organisms of agricultural,evolutionary, or ecological importance, which tendto have poor reference genomes andtranscriptomes. Focus on: Improving assembly sensitivity to better recover genomic/transcriptomic sequence, often from weird samples. Scaling sequence assembly approaches so that huge assemblies are possible and big assemblies are straightforward. 6. There is quite a bit of life left to sequence & assemhttp://pacelab.colorado.edu/ 7. Weird biological samples: Single genome Hard to sequence DNA(e.g. GC/AT bias) Transcriptome Differential expression! High polymorphism data Multiple alleles Whole genome Often extreme amplified amplification bias (next slide) Metagenome (mixed Differential abundance microbial community)within community. 8. Single genome assembly is alreadychallenging -- 9. Once you start sequencingmetagenomes 10. Shotgun sequencing andcoverage Coverage is simply the average number of reads that overlapeach true base in genome.Here, the coverage is ~10 just draw a line straight down from thetop through all of the reads. 11. Random sampling => deep samplingneededTypically 10-100x needed for robust recovery (300 Gbp for human) 12. Various experimental treatments canalso modify coverage distribution. (MD amplified) 13. Non-normal coverage distributionslead to decreased assemblysensitivity Many assemblers embed a coverage model in their approach. Genome assemblers: abnormally low coverage iserroneous; abnormally high coverage is repetitivesequence. Transcriptome assemblers: isoforms should havesame coverage across the entire isoform. Metagenome assemblers: differing abundancesindicate different strains. Is there a different way? (Yes.) 14. Memory requirements (Velvet/Oases est) Bacterial genome 1-2 GB (colony) 500-1000 GB Human genome 100 GB + Vertebrate mRNA 100 GB Low complexity metagenome 1000 GB ++ High complexity metagenome 15. Practical memory measurements 16. K-mer based assemblers scalepoorlyWhy do big data sets require big machines??Memory usage ~ real variation + number of errorsNumber of errors ~ size of data set 17. Why does efficiency matter? It is now cheaper to generate sequence than it is to analyze it computationally! Machine time (Wo)man power/time More efficient programs allow better exploration of analysis parameters for maximizing sensitivity. Better or more sensitive bioinformatic approaches can be developed on top of more efficient theory. 18. Approach: Digital normalization(a computational version of library normalization)Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!!This 100x will consume disk space and, becauseof errors, memory. We can discard it for you 19. Digital normalization 20. Digital normalization 21. Digital normalization 22. Digital normalization 23. Digital normalization 24. Digital normalization 25. Digital normalization approach A digital analog to cDNA library normalization, diginorm: Is single pass: looks at each read only once; Does not collect the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions. 26. Coverage before digitalnormalization:(MD amplified) 27. Coverage after digital normalization:Normalizes coverageDiscards redundancyEliminates majority oferrorsScales assembly dramatAssembly is 98% identica 28. Digital normalization approach A digital analog to cDNA library normalization,diginorm is a read prefiltering approach that: Is single pass: looks at each read only once; Does not collect the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions. 29. Contig assembly is significantly more efficient andnow scales with underlying genome size Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results. 30. Some diginorm examples:1. Assembly of the H. contortus parasitic nematode genome, a high polymorphism/variable coverage problem.2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a big assembly problem.3. Assembly of two Midwest soil metagenomes, Iowa corn and Iowa prairie the impossible assembly problem. 31. 1. The H. contortus problem A sheep parasite. ~350 Mbp genome Sequenced DNA 6 individuals after whole genome amplification, estimated 10% heterozygosity (!?) Significant bacterial contamination.(w/Robin Gasser, Paul Sternberg, and ErichSchwarz) 32. H. contortus life cycleRefs.: Nikolaou and Gasser (2006), Int. J. Parasitol. 36, 859-868;Prichard and Geary (2008), Nature 452, 157-158. 33. The power of next-gen. sequencing: get 180x coverage ... and then watch yourassemblies never finishLibraries built and sequenced: 300-nt inserts, 2x75 nt paired-end reads 500-nt inserts, 2x75 and 2x100 nt paired-end reads 2-kb, 5-kb, and 10-kb inserts, 2x49 nt paired-end reads Nothing would assemble at all until filtered for basic quality.Filtering let 500 nt-sized inserts to assemble in a mere week.But 2+ kb-sized inserts would not assemble even then.Erich Schwarz 34. Assembly after digital normalization Diginorm readily enabled assembly of a 404 Mbpgenome with N50 of 15.6 kb; Post-processing with GapCloser andSOAPdenovo scaffolding led to final assembly of453 Mbp with N50 of 34.2kb. CEGMA estimates 73-94% complete genome. Diginorm helped by: Suppressing high polymorphism, esp in repeats; Eliminating 95% of sequencing errors; Squashing coverage variation from whole genome amplification and bacterial contamination 35. Assembly after digital normalization Diginorm readily enabled assembly of a 404 Mbpgenome with N50 of 15.6 kb; Post-processing with GapCloser andSOAPdenovo scaffolding led to final assembly of453 Mbp with N50 of 34.2kb. CEGMA estimates 73-94% complete genome. Diginorm helped by: Suppressing high polymorphism, esp inrepeats; Eliminating 95% of sequencing errors; Squashing coverage variation from whole genomeamplification and bacterial contamination 36. Next steps with H. contortus Publish the genome paper Identification of antibiotic targets for treatment inagricultural settings (animal husbandry). Serving as reference approach for a widevariety of parasitic nematodes, many of whichhave similar genomic issues. 37. 2. Lamprey transcriptome assembly. Lamprey genome is draft quality; low contiguity,missing ~30%. No closely related reference. Full-length and exon-level gene predictions are 50-75% reliable, and rarely capture UTRs / isoforms. De novo assembly, if we do it well, can identify Novel genes Novel exons Fast evolving genes Somatic recombination: how much are we missing,really? 38. Sea lamprey in the Great Lakes Non-native Parasite ofmedium to largefishes Causedpopulations ofhost fishes tocrashLi Lab / Y-W C-D 39. Transcriptome results Started with 5.1 billion reads from 50 different tissues. Digital normalization discarded 98.7% of them as redundant, leaving 87m (!) These assembled into more than 100,000 transcripts > 1kb Against known full-length, 98.7% agreement (accuracy); 99.7% included (contiguity) 40. Evaluating de novo lamprey transcriptome Estimate genome is ~70% complete (gene complement) Majority of genome-annotated gene sets recovered by mRNAseq assembly. Note method to recover transcript families w/o genome Gene families in Fraction inAssembly analysisGene families genomegenomemRNAseq assembly7200351632 71.7%reference gene set 8523 8134 95.4%combined7377353137 72.0%intersection 6753 6753100.0%only in mRNAseq assembly6525044884 68.8%only in reference gene set 1770 1500 84.7% (Includes transcripts > 300 bp) 41. Next steps with lamprey Far more complete transcriptome than the one predicted from the genome! Enabling studies in Basal vertebrate phylogeny Biliary atresia Evolutionary origin of brown fat (previously thoughtto be mammalian only!) Pheromonal response in adults 42. 3. Soil metagenome assembly Observation: 99% of microbes cannot easily becultured in the lab. (The great plate countanomaly) Many reasons why you cant or dont want toculture: Syntrophic relationships Niche-specificity or unknown physiology Dormant microbes Abundance within communities Single-cell sequencing & shotgun metagenomicsare two common ways to investigate microbialcommunities. 43. SAMPLING LOCATIONS 44. Investigating soil microbialecology What ecosystem level functions are present, andhow do microbes do them? How does agricultural soil differ from native soil? How does soil respond to climate perturbation? Questions that are not easy to answer without shotgun sequencing: What kind of strain-level heterogeneity is present inthe population? What does the phage and viral population look like? What species are where? 45. A Grand Challenge dataset(DOE/JGI) Total: 1,846 Gbp soil metagenome600 MetaHIT (Qin et. al, 2011), 578 Gbp500Basepairs of Sequencing (Gbp)400300 Rumen (Hess et. al, 2011), 268 Gbp200Rumen K-mer Filtered, 111 Gbp100 NCBI nr database,37 Gbp0Iowa,Iowa, Native Kansas,Kansas, Wisconsin, Wisconsin, Wisconsin, Wisconsin,ContinuousPrairie Cultivated NativeContinuousNativeRestored Switchgrass corn corn Prairie cornPrairiePrairie GAII HiSeq 46. Whoa, thats a lot of data Estimated sequencing required (bp, w/Illumina)5E+14 4.5E+144E+14 3.5E+143E+14 2.5E+142E+14 1.5E+141E+145E+130 E. coli HumanVertebrateHuman gut Marine Soilgenome genome transcriptome 47. Additional Approach for Metagenomes: Data partitioning (a computational version of cell sorting)Split reads into binsbelonging todifferent sourcespecies.Can do this basedalmost entirely onconnectivity ofsequences.Divide and conquerMemory-efficientimplementationhelps to scaleassembly.Pell et al., 2012, PNAS 48. Partitioning separates reads by genome.When computationally spiking HMP mock data with one E. coligenome (left) or multiple E. coli strains (right), majority of partitions contain reads from only a single genome (blue) vsmulti-genome partitions (green). * * Partitions containing spiked data indicated with aAdina Howe * 49. Assembly results for Iowa corn and prairie(2x ~300 Gbp soil metagenomes)PredictedTotal Total Contigs% Reads proteinAssembly (> 300 bp) Assembled coding2.5 bill4.5 mill 19%5.3 mill3.5 bill5.9 mill 22%6.8 millPutting it in perspective:Total equivalent of ~1200 bacterial genomes Adina HoweHuman genome ~3 billion bp 50. Resulting contigs are lowcoverage. Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes. 51. Strain variation? Can measure by readTop two allele frequencies mapping. Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5%Position within contig 52. Tentative observations from our soilsamples: We need 100x as much data Much of our sample may consist of phage. Phylogeny varies more than functionalpredictions. We see little to no strain variation within oursamples Not bulk soil -- Very small, localized, and low coverage samples We may be able to do selective really deepsequencing and then infer the rest from 16s. Implications for soil aggregate assembly? 53. Additional projects -- Bacterial symbionts of bone eating worms w/ShanaGoffredi. Mixed sample => low-complexity metagenome Amplified DNA => uneven coverage ~50% complete => 97% complete assembly Molgulid ascidians w/Billie Swalla and LionelChristiaen. High heterozygosity transcriptomes and genomes Chick genome & transcriptome assembly w/HansCheng, Jerry Dodgson, and Wes Warren. Hard to sequence/assemble microchromosomes 54. Digital normalization is enabling. This is a very powerful technique for fixing weirdsamples. (and all samples are weird.) A number of real world projects are usingdiginorm successfully (~6-10 in my lab; ~70-80?overall). A diginorm-derived procedure is now arecommended part of the Trinity mRNAseqassembler. Diginorm is1. Very computationally efficient;2. Always cheaper than running an assembler in the first place3. Almost always improves results (** in our hands ;) 55. Next steps Computation is a limiting factor in dealing with NGS Assembly is slow, compute intensive, and sensitiveto parameters Mapping and SNP calling is parameter sensitive Exploring hypotheses and finding the right balancebetween sensitivity and specificity often requiresparameter sweeps executing one or morepieces of software with multiple parameters. Can we attack this problem with lossy compression? 56. Digital normalization retains information, whilediscarding data and errors 57. Lossy compressionhttp://en.wikipedia.org/wiki/JPEG 58. Lossy compressionhttp://en.wikipedia.org/wiki/JPEG 59. Lossy compressionhttp://en.wikipedia.org/wiki/JPEG 60. Lossy compressionhttp://en.wikipedia.org/wiki/JPEG 61. Lossy compressionhttp://en.wikipedia.org/wiki/JPEG 62. ~2 GB 2 TB of single-chassis RAM "Information" "Information" Raw data"Information"Analysis "Information" (~10-100 GB)~1 GB "Information"Database &integrationCan we use lossy compression approaches to makedownstream analysis faster and better? (Yes.)Embarking upon project to build theory around digitalnormalization so that we can use it as a prefilter formapping and variant calling. We have alsoimplemented error correction, which enables easyquantification of references by reads. Streaming low-mem distributable prefilters for all 63. Error correction for metagenomic andmRNAseq data.Most error correction algorithms rely onassumption of uniform coverage; ours does not.Ours is also streaming. Jason Pell 64. Error correction via graph alignment 65. Where next? Assembly in the cloud! Study and formalize paired/end mate pair handling indiginorm. Web interface to run and evaluate assemblies. New methods to evaluate and improveassemblies, including a meta assembly approach formetagenomes. Fast and efficient error correction of sequencing data Can also address assembly of high polymorphismsequence, allelic mapping bias, and others; Can also enable fast/efficient storage and search of nucleicacid databases. 66. Four+ papers on our work, soon. 2012 PNAS, Pell et al., pmid 22847406 (partitioning). Submitted, Brown et al., arXiv:1203.4802 (digitalnormalization). Submitted, Howe et al, arXiv: 1212.0159 (artifactremoval from Illumina metagenomes). Submitted, Howe et al., arXiv: 1212.2832 Assembling large, complex environmentalmetagenomes. In preparation, Zhang et al. efficient k-mer counting. 67. Thanks!Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/interests.html Preprints: on arXiv, q-bio:diginorm arxiv

Documents

2013 duke-talk