31
Metagenomics workflow Fanny-Dhelia Pajuste Supervisor: Balaji Rajashekar 12.12.2014

Metagenomics workflow - Kursused · 2014. 12. 12. · Data Usually sequenced using next generation sequencing methods Contains reads from thousands of organisms Publicly available

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • Metagenomics workflow

    Fanny-Dhelia PajusteSupervisor: Balaji Rajashekar

    12.12.2014

  • Metagenomics

    ● Studying genomic sequences directly from environmental samples

    ● Samples contain sequences of thousands of different organisms

    ● Can be used for:○ personal medicine○ environmental studies○ agriculture etc

  • Metagenomics Tasks

    Identifying:● organisms (species, strains, ..)● the abundance of organisms● genes● functions

  • Data

    ● Usually sequenced using next generation sequencing methods

    ● Contains reads from thousands of organisms● Publicly available data from:

    ○ MG-RAST (over 140 000 metagenomes)○ IMG○ EBI

  • Data for This Project

    ● From MG-RAST● Metagenome of human oral cavity under

    health and diseased conditions● Eight samples● Different oral health status

  • Workflow and Methods

  • Data Preprocessing

    Reads are filtered based on:● quality● length● ambiguous bases● (replication)Can be purified from some species

  • Assembly

    ● Assembling to larger DNA sequences (contigs and scaffolds)

    ● Uses de Brujin or overlap graphs● Depends on type of reads● Might need manual

    inspection (errors)

    http://genome.jgi-psf.org/help/scaffolds.html

    http://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.html

  • Assembly: Isolate vs Metagenomes● �Assuming a uniform coverage depth across a

    genome ○ Identifying repeat regions○ Estimating the size of a genome

    Different coverage depth (relative abundance)● Repeat regions in a single genome vs between

    multiple genomes

  • Assembly: Isolate vs Metagenomes

    ● Sequencing errors○ Introduce false overlaps○ Disrupt true overlaps

    Error correction using consensus sequence for isolate genomes

  • Assembly: Methods

    ● IDBA-UD● MetaVelvet ● SOAPdenovo● MetaSim● Omega

  • Gene Calling● �Prediction of genes:

    Identifying protein or RNA sequences coded on the DNA present in the sample

    ● Data used:○ Initial reads○ Assembled contigs○ Both

  • Gene Calling: Approaches● Evidence based:

    ○ Metagenome is search for similar genes that are already known - homology searches

    ● Ab initio:○ Without previous knowledge○ Relying on internal feature of DNA○ Can use evidence-based found genes as training set

  • Gene Calling: Methods● BLAST● CRITICA● Orpheus● GLIMMER● MetaGene

  • Function Calling● Identifying the functions of the organisms in

    a sample○ What enables the organisms to have certain effects○ Identify the functional relations between samples

    ● We should know the coding and functional capacity of most of the species present in this sample

  • Function Calling: Approaches● Homology based

    ○ Compare predicted query proteins to known sequence databases

    ○ Might not be present in database○ Computationally hard

    ● Motif based○ Same/similar function, but different sequences

    ● + Genomic neighborhood information

  • Function Calling: Methods● BLAST● HMMER

  • Classification● Also called binning● Identifying the organisms present● Approaches:

    ○ Assembly based○ Marker genes○ Supervised methods○ Unsupervised methods

    ● Different accuracy - species, strains etc

  • Classification: Methods● BLAST● MEGAN● PhymmPL● Naive Bayes Classifier● Kraken

  • MEGAN● MEtaGenome ANalyser● BLAST - to search reads against database● NCBI taxonomy - to assign a taxon ID for

    each sequence● Each read is assigned to LCA of the set of

    taxa● Bottleneck - comparison of sequences

  • MEGAN

  • Kraken● Exact matching of k-mers to databases● Mapped to LCA of the genome ● Classification tree - Taxa and it ancestors +

    number of k-mers mapped to it as weights● Maximal root-to-leaf paths are calculated● Leaf is used as the classification

  • Kraken

  • Kraken

    ● Standard database: 150 GB● MiniKraken - 4 GB● Takes fasta/fastq● Classifies every sequence from the input

    file

  • Kraken: Output● Output has five columns:

    ○ C/U - classified or unclassified○ Sequence ID from input file header○ Taxonomy ID for classification○ Length of the sequence in bp○ List of LCA mappings: "562:13 561:4 A:31 0:1 562:3"

  • Kraken: OutputU GF8803K01A0000 0 506 0:476

    C GF8803K01A000R 553174 496 0:216 553174:1 0:83 553174:1 0:165

    U GF8803K01A001D 0 458 0:428

    C GF8803K01A001U 649638 533 0:257 95818:1 0:82 649638:1 0:1 649638:1 0:10 2:1 0:39 541000:1 0:109

    U GF8803K01A0028 0 297 0:267

    U GF8803K01A003Q 0 481 0:451

    U GF8803K01A004I 0 134 0:104

    C GF8803K01A004M 767031 485 0:39 767031:1 0:56 767031:1 0:25 767031:1 0:7 767031:1 0:19 767031:1 0:24 767031:1 0:67 767031:1 0:22 767031:1 0:76 767031:1 0:5 767031:1 0:52 767031:1 0:2 767031:1 0:18 767031:1 0:29 767031:1

    U GF8803K01A0058 0 512 0:482

  • Kraken: Report 1 71.90 244099 244099 U 0 unclassified

    28.10 95404 10 - 1 root

    28.08 95317 11 - 131567 cellular organisms

    28.07 95296 509 D 2 Bacteria

    12.65 42959 2 - 68336 Bacteroidetes/Chlorobi group

    12.65 42951 29 P 976 Bacteroidetes

    12.54 42563 0 C 200643 Bacteroidia

    12.54 42563 291 O 171549 Bacteroidales

    12.05 40898 0 F 171552 Prevotellaceae

    12.05 40898 1402 G 838 Prevotella

    6.62 22485 0 S 28132 Prevotella melaninogenica

  • Kraken: Report 2 d__Viruses 77

    d__Bacteria 95296

    d__Archaea 10

    d__Bacteria|p__Cyanobacteria 20

    d__Bacteria|p__Proteobacteria 1346

    d__Bacteria|p__Firmicutes 39094

    d__Bacteria|p__Deinococcus-Thermus 6

    d__Bacteria|p__Candidatus_Saccharibacteria 54

    d__Bacteria|p__Cloacimonetes 3

    d__Bacteria|p__Fusobacteria 4686

    d__Bacteria|p__Verrucomicrobia 3

  • Kraken: Results

  • Conclusion● Metagenomic data contains fragments of DNA

    sequences of a great number of organisms in an environmental sample

    ● Metagenomics is an important field - can be used for medicine, environmental studies etc

    ● Can be used to identify organisms, genes or functions

  • Thank you!