Microbiome Analysis from sample to data MGL Users Group June 18, 2014

Assaying Microbial Content One of the most common approaches is to sequence 16S ribosomal RNA amplicons. Another option is shotgun sequencing of the community, assembling the sequences, and assigning the identified genes to metabolic pathways. If a finer level of detail is required, most often 16S is sequenced followed by generalized sequencing for a finer species resolution (or validation).

16S rRNA Sequencing Databases of all known 16S sequences have been compiled (Silva, GreenGenes, others). Either targeted amplicons of variable regions or whole 16S sequencing. Isolate gDNA, PCR amplify using universal 16S primers. imers Primer pair 1 Primer pair 2 Primer pair 3

16S rRNA Sequencing Shear amplicon using Covaris focused acoustics

The Microbiome Library Libraries have adaptor sequences at both ends used for PCR and sequencing priming. P1 is the universal Forward primer sequence. P2 has an embedded barcode sequence. Between the two adapter ends is the DNA which will be sequenced from the P1 forward, and Barcode regions (green arrows). Note: Adapter sequences DIFFER from Illumina if other preparations are to be adapted to this platform.

Bead Preparation from Libraries The pool of libraries is subjected to emulsion PCR to populate beads. Oil micro-reactors are titrated such that each bead is populated by a single template. Unpopulated beads are removed in subsequent cleanup.

16S rRNA Sequencing Bead Preparation from Libraries Nick translate Amplify Quantitate

Slide Deposition of enriched beads Beads are flowed into, and then adhered to, the FlowChip lanes. Optimum density is 160 million beads per lane.

ABI SOLiD 5500xl 16S rRNA Sequencing

The resulting library is sequenced. We do 75 bp on one end (Exact Call Chemistry; most commonly done on a long-read platform [454, MiSeq, etc.]). We generate millions of reads (most commonly generate thousands). Reads are aligned to the database of 16S sequences to the possible level of resolution. We keep only uniquely aligned reads.

Data Analysis - OTUs Sequences are often reported in OTUs (Operational Taxonomic Units) Due to high levels of identity in related 16S sequences, typically some identity threshold is applied and similar sequences are collapsed into OTU sequences (commonly at 97% identity) As a result, the level of taxonomic resolution for individual OTU sequences can vary, even at the same identity threshold.

OTU examples 367523k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__ 187144k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__ 836974k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Cercozoa; f__; g__; s__ 310669k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__ 823916k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Moraxellaceae; g__Enhydrobacter; s__ 878161k__Bacteria; p__Acidobacteria; c__Acidobacteriia; o__Acidobacteriales; f__Acidobacteriaceae; g__Terriglobus; s__ 3064251k__Bacteria; p__Verrucomicrobia; c__Opitutae; o__Puniceicoccales; f__Puniceicoccaceae; g__Puniceicoccus; s__ 1138555k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Caldicoprobacteraceae; g__Caldicoprobacter; s__ 3918k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__ 339472k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__; s__ 4457583k__Bacteria; p__; c__; o__; f__; g__; s__ (k=kingdom; p=phylum; c=class; o=order; f=family; g=genus; s=species) Typically lack species-level resolution (as seen in the example subset), but some get down to Order, Family, or even Genus. A few really not identifiable.

Gut microbiota Primarily comprised of Firmicutes and Bacteroidetes. Balance between these populations has been linked to obesity in mice. Example using public dataset, treated to simulate short 75bp random reads 3 Simulations on same dataset TestLib1TestLib2 TestLib1 TestLib3

Initial trial run

... Hundreds of lines

Initial run results 4 mice run: 2 WT; 2 KO Phylum resolution N=7.5 Million E1E2 E3E4

Initial run results 4 mice run: 2 WT; 2 KO Class resolution N=7.5 Million

Initial run results 4 mice run: 2 WT; 2 KO Order resolution N=7.4 Million

Initial run results 4 mice run: 2 WT; 2 KO Family resolution N=2.2 Million

Initial run results 4 mice run: 2 WT; 2 KO Genus resolution N=1.1 Million (Species N=340K; More complex)

For more information: Website: mgl.nichd.nih.gov List serv: MGL-USERS-L Email: [email protected] Phone: 301-402-4563 Walk-in: Bldg 10/Rm 9D41

Typical Approach Use long reads from a whole, intact amplicon (A few thousand reads typically used) Perform trimming, remove chimeric sequences, join overlaps in paired ends, etc. Compare sequences to database through BLAST or comparison to a prepared multi-sequence alignment. Compare / clean data Assign resolve taxonomy, describe distribution Compare populations across conditions, etc. (Statistical digging)

Alternative approach using short reads Amplify 16S or amplicon as normal. Randomly shear to construct a typical short read library comprising random starts/ends. Generate millions of reads. Assign reads that only map unambiguously to OTUs using short read aligners. Analyze normally from OTU populations. A more wasteful approach, but in practice performs just as well. Utilizes higher throughput instruments vs lower capacity long-read platforms.

Documents

Microbiome Analysis from sample to data MGL Users Group June 18, 2014