Upload
autumn-putman
View
232
Download
1
Tags:
Embed Size (px)
Citation preview
de Bruijn NG (1946). A Combinatorial Problem. Koninklijke Nederlandse Akademie v. Wetenschappen 49: 758–764.Eddy SR (2009). A new generation of homology search tools based on probabilistic inference. Genome Informatics
23: 205-211. doi: 10.1142/9781848165632_0019.Hart PE, Nilsson NJ, Raphael B (1968). "A Formal Basis for the Heuristic Determination of Minimum Cost Paths". IEEE
Transactions on Systems Science and Cybernetics SSC4 4(2): 100–107. doi:10.1109/TSSC.1968.300136.Yen JY (1971). Finding the K Shortest Loopless Paths in a Network. Management Science Theory Series (July) 17(11):
712-716. Published by: INFORMS. Http://www.jstor.org/stable/2629312Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT (in press). Scaling metagenome sequence assembly
with probabilistic de Bruijn graphs. http://arxiv.org/abs/1112.4193.
REFERENCES
Fig. 2: Combined Graph (CG) GLBRC DATASETS
Two miscanthus, two switchgrass samples
500M 100BP reads in each sample
Assembled separately with Xander searching for nifH
Examined nifH group composition in the four samples
Xander: Gene-Targeted Metagenomic Assembly
1Center for Microbial Ecology; 2Department of Computer Science and Engineering; 3Department of Microbiology and Molecular Genetics Michigan State University, East Lansing, MI 48824-4320
Contact: [email protected]
Jordan A. Fish1, Qiong Wang1, Yanni Sun2, C. Titus Brown2,3, James M. Tiedje1,3 and James R. Cole1
Very large metagenomes tax the abilities of current-generation short-read assemblers. In addition to space and time complexity issues, most assemblers are not designed to correctly treat reads from closely related populations of organisms. We are developing a gene-targeted approach for metagenome assembly.
In this approach, information about specific genes is used to guide assembly, and gene annotation occurs concomitantly with assembly. This approach combines a space-efficient De Bruijn graphical representation of the reads with a protein profile Hidden Markov Model for the gene(s) of interest. To limit the search, we use a heuristic to first identify nucleotide k-mers that translate to peptides found in a set of representatives of the target protein family. These k-mers, along with the positions of the peptides in the HMM representation, define a set of search start points.
Contigs are then assembled by applying graph path-finding algorithms in both directions on the combined De Bruijn-HMM graph structure. Using this technique, we have been able to extract complete nifH protein coding regions from several 50G soil metagenomes, including metagenomes from an Iowa great prairie soil and soils planted with Miscanthus and Switchgrass, two potential biofuel crops. In addition, we have extracted complete but genes coding for butyryl-CoA transferase from human gut metagenomes.
Future work will focus on separating sequencing artifacts from low-coverage rare populations.
INTRODUCTION
METHODS
De Bruijn transitions
Combined graph transitions
HMM states
This work was funded in part by the DOE Great Lakes Bioenergy Research Center (DOE BER Office of Science DE-FC02-07ER64494, DOE OBP Office of Energy Efficiency and Renewable Energy DE-AC05-76RL01830), the Great Prairie Soil Metagenomes Project sponsored by DOE’s Joint Genome Institute (piloting for DOE’s Grand Challenge Program), and NIH/PHS Human Microbiome Project (The Role of Gut Microbiota in Ulcerative Colitis grant UH3-DK083993-02).
Xander is a De Bruijn Graph assembler designed for gene targeted metagenomic assembly. We use a space efficient graph representation to enable scaling to large datasets. Xander is a local assembly tool; starting from a node in the graph, we walk in each direction using a Hidden Markov Model as a guide to assemble genes of interest. In order to explore population level diversity we have developed methods to find additional, sub-optimal, paths.
PER-GENE PREPARATION
Select high-quality representative sequences
Build Forward and Reverse HMMs
Select reference set of known protein sequences
SUB-OPTIMAL PATHS
In order to capture the population level diversity in metagenomic samples we implemented a modified version of Yen’s K-Shortest Path Algorithm. Yen’s algorithm will find the K-Shortest paths, even if those paths contain all the same nodes. However, we are interested in paths that contain new nodes. Once we have the K-Shortest paths, we extract the subgraph induced by the nodes contained in the K paths.
SEARCHING
The De Bruijn Graph and HMM are combined on the fly to create a graph where nodes represent both a k-mer from the De Bruijn graph and HMM state (position in the model and match/insert/delete state).
The edges represent both transitions between k-mers in the De Bruijn graph and between positions in the HMM model. Edges are weighted with transition and emission probabilities from the HMM.
We find the best path from each starting node using the A* search algorithm, using the probability of the most probable path from the current node as the heuristic value function.
Fig. 3: nifH groups present in miscanthus and switchgrass samples
Total 2,780Group 1
RESULTS
ACKNOWLEDGMENTS
Miscanthus #1
Miscanthus #2
Switchgrass #1
Switchgrass #2
Fig. 1: Xander Pipeline
GENE: but (butyryl-CoA transferase) Butyrate serves as the major energy source of colonocytes, has anti-inflammatory properties, and regulates gene expression, differentiation and apoptosis in host cells. In healthy individuals the but pathway is the major pathway for butyrate production in human gut.
RESULTS
Xander searched and assembled 56 unique protein sequences with length >100. Only two nearly identical sequences were full length. These were very similar (2 and 4 AA substitutions) to a but gene from the HMP reference genome sequence of Acidaminococcus sp. D21, isolated from a healthy human gut.
HMP DATASET
100M 101-bp reads, 15G metagenomic shotgun Human Gut data from an ulcerative colitis (UC) patient who underwent a colectomy followed by ileal pouch anal anastomosis. In this procedure, the entire colon is resected, the terminal ileum is fashioned into a pouch, connected to the anal canal and the intestinal flow is re-established.
SequenceNucleotide
SubstitutionsAA
Substitutions
1 4 20 V I
194 Q P
2 8
20 V I
139 V G
141 A S
194 Q P
Table: Substitutions found in the two full-length but
sequences assembled by Xander
MOCK DATASET
Gene: Azospirillum brasilense Sp245
Made mock reads using BioGrinder: 100BP-long reads, simulated Illumina errors, targeted 10x coverage of the genome
Assembled with Xander searching for nifH
One k-mer selected for sub-optimal path extraction
Examined k-mer coverage of known nifH k-mers
>7 Occurrences
2-7 Occurrences
1 Occurrence
Fig. 4: Small portion of subgraph induced by top 100 pathsfrom one start point