Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Microbiome analysisFrom experimental design to integrate other omics data
Yuan-Ming Yeh, Ph.D.
Genomic Medicine Core Laboratory,
Chang Gung Memorial Hospital, Linkou
2019.05.17 @ CGU
First Peek at Microbes
Bacteria Morphological Diversity: fmhttp://ag.arizona.edu/plp/courses/plp329/micdivintro.ppt 3
Types of Microbes
https://www.slideshare.net/MohammedInzamamuddin/microbes-of-extreme-environment 5
There are 5 main types of microbes
Microbes Run the World
http://www.mansfield.ohio-state.edu/~sabedon/lectures/index.html 10
•The Earth is a “microbial planet” – microorganisms predate other life forms (have evolved for some 3.8 billion years) – they are the most abundant -- both in terms of numbers and distribution – Microbial activities have profound influence on the integrity and functioning of global ecosystems.
• “..The diversity and range of their environmental adaptations indicate that microbes long ago ‘solved’ many problems for which scientists are still actively seeking solutions.”(Microbial genome program, US. Department of Energy; http://microbialgenomics.energy.gov/)
Microbes and You
13
• Every part of your body that normally comes in contact with outside world (deep lungs and stomach are exceptions)
• You are “what you eat” – Human gut microbes
• “Good” and “bad” microorganisms
Microbes and Industry
14
• Industry: Fermentation products (ethanol, acetone, etc.)
• Food: Wine, cheese, yogurt, bread, half-sour pickles, etc.
• Biotech: Recombinant products (e.g., human insulin, vaccines)
• Environment: Bioremediation
• Bugs+Plus: to digest oil and other petroleum derivatives.
Potential Microbial Applications
15
• Cleanup of toxic-waste sites worldwide. • Production of novel therapeutic and preventive agents and
pathways. • Energy generation and development of renewable energy sources
(e.g., methane and hydrogen). • Production of chemical catalysts, reagents, and enzymes to
improve efficiency of industrial processes. • Management of environmental carbon dioxide, which is related to
climate change. • Detection of disease-causing organisms and monitoring of the
safety of food and water supplies. • Use of genetically altered bacteria as living sensors (biosensors) to
detect harmful chemicals in soil, air, or water. • Understanding of specialized systems used by microbial cells to
live in natural environments with other cells. (http://microbialgenomics.energy.gov/)
Metagenomics history
Craig Venter
Celera GenomicsThe Institute for Genomic ResearchThe J. Craig Venter Institute
Global Ocean Sampling Expedition (GOS)
The pilot project, conducted in the Sargasso Sea, found DNA from nearly 2000 different species, including 148 types of bacteria never before seen.
shotgun sequencing
The big picture
Explore the relationship between microbes and their habitat
To accomplish this, we use a series of experimental and computational techniques to make inferences about the community:- Marker genes - Metagenomes- Metatranscriptomes- Metaproteomes- Metametabolomes- “Culturomes”
Bioinformatics.ca
Human gut microbiome: 2-3 million genes
Typically > 160 “species” at any given sampling time
Host: ~25,000 genes
Qin et al., Nature (2010)
The Human Microbiome
Bioinformatics.ca
The Human Microbiome
Darryl Leja, NHGRI 20
There are both tremendous similarities and differences among the bacterial species that predominate at different sites on the human body. Colors represent different phyla and families of bacteria.
From Molecular Biology to Microbiome
Garrett 2015; Claesson et.al. 2017 21
Hypothesis:• changes in the microbiome (longitudinal analysis) • whether microbiome differences correlate with clinical phenotypes
(cross‐sectional or cohort analysis)
How many samples per group ?
22
Statistics: • suggestion 5x samples, minimum 3x samples
Biologics: • plant, soil, water => 10x samples, • human gut => 20x samples
Oxygen
Diet
PH
Moisture
Exercise
Light
Supplement
Sample types / collection
Claesson et.al. 2017 24
correlate with clinical phenotypes (cross‐sectional or cohort analysis)
Accelerated by NGS, predominately 454 sequencing because of the longer read length, now more with Illumina based chemistry.Organism no longer needs to be cultivated and cloned — Culture independent insightDirect sequencing from environment as a “community”You can pool multiple samples together
NGS and metagenomics
Not all microbes can be cultured
Metagenomic reads vs 16S rRNA for microbial diversity identification
Metagenome
DNA Isolation
Fragmentation of DNA
Metagenomic Reads
Amplification of 16S rRNA
16S rRNA from multiple species
Microbial diversity
Microbial diversity
30
16S rRNA – a “gold standard” for microbial molecular identification
• Universal • Highly conserved• Long enough (~1500 bp) to provide significant discrimination
between many species• Structural information can guide alignment and phylogenetic
reconstruction• Many species now represented in the database
16S rRNA gene sequencing
Earlier By sequencing whole gene
Now By sequencing short variable regions
Limitations:
• Insufficient and underestimated diversity
16S ribosomal protein• highly conserved between different species of bacteria and archaea
• whereas the rest of genetic content varies greatly across species
• 16S RNA can be used for taxonomical classification
Analysis approach
Taxonomy independent analysisReads are group into operational taxonomic units
(OTU) based on a specified sequence variation.
Taxonomy dependent analysisAssignment at the level of domain, phylum, class,
order, family, genus, and speciesRequire a reference database
Taxonomy independent analysis Group reads into OTU based on certain imposed similarity threshold
In study of bacteria, 97% seems like a good starting pointSpecies dependent, genes dependent, threshold may
vary1 OTU = 1 organism
Extract a OTU representative sequence
Most common sequenceSequence that has minimum difference to all other
sequences in the same OTU
Taxonomy dependent analysis Classify sequences BLAST Simply BLAST what you have (phylotyping) MEGAN
Online RDP classifier (Ribosomal Database Project ) RDP 10.26 (Release 11, Update 5 consists of 3,356,809 aligned
and annotated 16S rRNA sequences Limited by number of reads you can submit
Online Greengenes classifier based on NAST alignment Require pre-aligned dataset Limited by number of reads you can submit
One common phylotyping workflow
• Run the blast aligners with the reads against the NCBI bacterial database (can be very time consuming)
• Use MEGAN – Metagenome Analyzer to process the results
http://ab.inf.uni-tuebingen.de/software/megan/
The mothur package
• Primarily OTU based but it has phylotyping functionality built in
http://www.mothur.org/
The QIIME package
• Takes users data easy to OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences of OTUs, and through downstream statistical analysis, visualization, and production of publication-quality graphics.
http://qiime.org/
Data Quality• Sequencing errors
• Introduced in workup• Error rates, error type (PacBio: 10% random, Illumina – 0.1%
substitution) • Chimeras
• Amplification artifacts, cloning of restriction fragments• Metadata Acquisition and Availability
• Studying complex ecosystem – multifactorial• Public datasets, often metadata embedded in publications or
simply not available
Bioinformatics.ca
Comparability / Reproducibility• 16S: different V regions give different results
• Different sequencing platforms / sampling conditions ALSO give different results
• Eisen paper about different recoveries under different conditions
• Workflow complexity / plethora of tools• Difficult to evaluate tools for microbiome analysis • Ground truth hard to establish for microbiome samples• Use of mock communities or simulated data
Bioinformatics.ca
Linkage and resolution• Strain-level diversity in metagenomes will often be missed due to
difficulties in differentiating minor variants from sequencing errors
• Should you assemble metagenomic reads? • Longer sequences have more information• By assembling the reads, you could create chimeric contigs
consisting of DNA fragments from different (non-clonal) organisms
Bioinformatics.ca
Taxonomy and OTUs
RDP taxonomic predictions+
taxonomy in general
OTUs – arbitrary, quasi-phylogenetic
Seed sequences
???
De novo
97%
Bioinformatics.ca
Chimeras
59
https://doi.org/10.1101/074252
Chimeras are sequences formed from two or more biological sequences joined together. Amplicons with chimeric sequences can form during PCR. Chimeras are rare with shotgun sequencing, but are common in amplicon sequencing when closely related sequences are amplified. Although chimeras can be formed by a number of mechanisms, the majority of chimeras are believed to arise from incomplete extension. During subsequent cycles of PCR, a partially extended strand can bind to a template derived from a different but similar sequence.
UNOISE
60
https://doi.org/10.1101/081257http://dx.doi.org/10.1038/nmeth.2604
Schematic of the UNOISE2 denoisingstrategy
UPARSE
MiSeq 2x250 16S V4 Exercise• Description
This example shows a typical analysis pipeline for MiSeq paired reads. There are four samples: Human, Mouse, Soil and Mock with ~4k reads each. Human and Mouse are fecal samples. Data is from Kozich et al. 2013.
62
Prepare material
• “Usearch” program upload
• Reads:• wget https://drive5.com/downloads/ex_miseq_reads.tar.gz
• Sintax reference database:• wget https://drive5.com/sintax/rdp_16s_v16.fa.gz
STEP1
63
• # Merge paired reads• # Add sample name to read label (-relabel option)• # Pool samples together in raw.fq (Linux cat command)
for Sample in Mock Soil Human Mousedo
$usearch11 -fastq_mergepairs ../data/${Sample}*_R1.fq \-fastqout $Sample.merged.fq -relabel $Sample.
cat $Sample.merged.fq >> all.merged.fqdone
STEP2
65
• # Strip primers (V4F is 19, V4R is 20)
$usearch11 -fastx_truncate all.merged.fq -stripleft 19 -stripright 20 \-fastqout stripped.fq
• # Quality filter
$usearch11 -fastq_filter stripped.fq -fastq_maxee 1.0 \-fastaout filtered.fa -relabel Filt
STEP3
66
• # Find unique read sequences and abundances
$usearch11 -fastx_uniques filtered.fa -sizeout -relabel Uniq -fastaoutuniques.fa
• # Make 97% OTUs and filter chimeras
$usearch11 -cluster_otus uniques.fa -otus otus.fa -relabel Otu
• # Denoise: predict biological sequences and filter chimeras
$usearch11 -unoise3 uniques.fa -zotus zotus.fa
################################################### Downstream analysis of OTU sequences & OTU table# Can do this for both OTUs and ZOTUs, here do# just OTUs to keep it simple.##################################################• # Make OTU table$usearch11 -otutab all.merged.fq -otus otus.fa -otutabout otutab_raw.txt
• # random subsampling to 0.5k reads / sample$usearch11 -otutab_rare otutab_raw.txt -sample_size 500 -output otutab.txt
• # Alpha diversity$usearch11 -alpha_div otutab.txt -output alpha.txt
• # Make OTU tree$usearch11 -calc_distmx otus.fa -tabbedout mx.txt -maxdist 0.2 -termdist 0.3 $usearch11 -cluster_aggd mx.txt -treeout otus.tree
68
• # Beta diversitymkdir beta/$usearch11 -beta_div otutab.txt -tree otus.tree -filename_prefix beta/
• # Rarefaction!!$usearch11 -alpha_div_rare otutab.txt -output rare.txt
• # Predict taxonomy$usearch11 -sintax otus.fa -db rdp_16s_v16.fa -strand both \-tabbedout sintax.txt -sintax_cutoff 0.8
• # Taxonomy summary reports$usearch11 -sintax_summary sintax.txt -otutabin otutab.txt -rank g -output genus_summary.txt$usearch11 -sintax_summary sintax.txt -otutabin otutab.txt -rank p -output phylum_summary.txt
• # Find OTUs that match mock sequences$usearch11 -uparse_ref otus.fa -db ../data/mock.merged.fq -strand plus \-uparseout uparse_ref.txt -threads 1
70