Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Metagenomics workflow
Fanny-Dhelia PajusteSupervisor: Balaji Rajashekar
12.12.2014
Metagenomics
● Studying genomic sequences directly from environmental samples
● Samples contain sequences of thousands of different organisms
● Can be used for:○ personal medicine○ environmental studies○ agriculture etc
Metagenomics Tasks
Identifying:● organisms (species, strains, ..)● the abundance of organisms● genes● functions
Data
● Usually sequenced using next generation sequencing methods
● Contains reads from thousands of organisms● Publicly available data from:
○ MG-RAST (over 140 000 metagenomes)○ IMG○ EBI
Data for This Project
● From MG-RAST● Metagenome of human oral cavity under
health and diseased conditions● Eight samples● Different oral health status
Workflow and Methods
Data Preprocessing
Reads are filtered based on:● quality● length● ambiguous bases● (replication)Can be purified from some species
Assembly
● Assembling to larger DNA sequences (contigs and scaffolds)
● Uses de Brujin or overlap graphs● Depends on type of reads● Might need manual
inspection (errors)
http://genome.jgi-psf.org/help/scaffolds.html
http://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.html
Assembly: Isolate vs Metagenomes● �Assuming a uniform coverage depth across a
genome ○ Identifying repeat regions○ Estimating the size of a genome
Different coverage depth (relative abundance)● Repeat regions in a single genome vs between
multiple genomes
Assembly: Isolate vs Metagenomes
● Sequencing errors○ Introduce false overlaps○ Disrupt true overlaps
Error correction using consensus sequence for isolate genomes
Assembly: Methods
● IDBA-UD● MetaVelvet ● SOAPdenovo● MetaSim● Omega
Gene Calling● �Prediction of genes:
Identifying protein or RNA sequences coded on the DNA present in the sample
● Data used:○ Initial reads○ Assembled contigs○ Both
Gene Calling: Approaches● Evidence based:
○ Metagenome is search for similar genes that are already known - homology searches
● Ab initio:○ Without previous knowledge○ Relying on internal feature of DNA○ Can use evidence-based found genes as training set
Gene Calling: Methods● BLAST● CRITICA● Orpheus● GLIMMER● MetaGene
Function Calling● Identifying the functions of the organisms in
a sample○ What enables the organisms to have certain effects○ Identify the functional relations between samples
● We should know the coding and functional capacity of most of the species present in this sample
Function Calling: Approaches● Homology based
○ Compare predicted query proteins to known sequence databases
○ Might not be present in database○ Computationally hard
● Motif based○ Same/similar function, but different sequences
● + Genomic neighborhood information
Function Calling: Methods● BLAST● HMMER
Classification● Also called binning● Identifying the organisms present● Approaches:
○ Assembly based○ Marker genes○ Supervised methods○ Unsupervised methods
● Different accuracy - species, strains etc
Classification: Methods● BLAST● MEGAN● PhymmPL● Naive Bayes Classifier● Kraken
MEGAN● MEtaGenome ANalyser● BLAST - to search reads against database● NCBI taxonomy - to assign a taxon ID for
each sequence● Each read is assigned to LCA of the set of
taxa● Bottleneck - comparison of sequences
MEGAN
Kraken● Exact matching of k-mers to databases● Mapped to LCA of the genome ● Classification tree - Taxa and it ancestors +
number of k-mers mapped to it as weights● Maximal root-to-leaf paths are calculated● Leaf is used as the classification
Kraken
Kraken
● Standard database: 150 GB● MiniKraken - 4 GB● Takes fasta/fastq● Classifies every sequence from the input
file
Kraken: Output● Output has five columns:
○ C/U - classified or unclassified○ Sequence ID from input file header○ Taxonomy ID for classification○ Length of the sequence in bp○ List of LCA mappings: "562:13 561:4 A:31 0:1 562:3"
Kraken: OutputU GF8803K01A0000 0 506 0:476
C GF8803K01A000R 553174 496 0:216 553174:1 0:83 553174:1 0:165
U GF8803K01A001D 0 458 0:428
C GF8803K01A001U 649638 533 0:257 95818:1 0:82 649638:1 0:1 649638:1 0:10 2:1 0:39 541000:1 0:109
U GF8803K01A0028 0 297 0:267
U GF8803K01A003Q 0 481 0:451
U GF8803K01A004I 0 134 0:104
C GF8803K01A004M 767031 485 0:39 767031:1 0:56 767031:1 0:25 767031:1 0:7 767031:1 0:19 767031:1 0:24 767031:1 0:67 767031:1 0:22 767031:1 0:76 767031:1 0:5 767031:1 0:52 767031:1 0:2 767031:1 0:18 767031:1 0:29 767031:1
U GF8803K01A0058 0 512 0:482
Kraken: Report 1 71.90 244099 244099 U 0 unclassified
28.10 95404 10 - 1 root
28.08 95317 11 - 131567 cellular organisms
28.07 95296 509 D 2 Bacteria
12.65 42959 2 - 68336 Bacteroidetes/Chlorobi group
12.65 42951 29 P 976 Bacteroidetes
12.54 42563 0 C 200643 Bacteroidia
12.54 42563 291 O 171549 Bacteroidales
12.05 40898 0 F 171552 Prevotellaceae
12.05 40898 1402 G 838 Prevotella
6.62 22485 0 S 28132 Prevotella melaninogenica
Kraken: Report 2 d__Viruses 77
d__Bacteria 95296
d__Archaea 10
d__Bacteria|p__Cyanobacteria 20
d__Bacteria|p__Proteobacteria 1346
d__Bacteria|p__Firmicutes 39094
d__Bacteria|p__Deinococcus-Thermus 6
d__Bacteria|p__Candidatus_Saccharibacteria 54
d__Bacteria|p__Cloacimonetes 3
d__Bacteria|p__Fusobacteria 4686
d__Bacteria|p__Verrucomicrobia 3
Kraken: Results
Conclusion● Metagenomic data contains fragments of DNA
sequences of a great number of organisms in an environmental sample
● Metagenomics is an important field - can be used for medicine, environmental studies etc
● Can be used to identify organisms, genes or functions
Thank you!