Metagenomic Analysis Using MEGAN?. Introduction In metagenomics, the aim is to understand the composition and operation of complex microbial consortia

Metagenomic Analysis Using MEGAN?

Introduction

In metagenomics, the aim is to understand the composition and operation of complex microbial consortia in environmental samples through sequencing and analysis of their DNA.

Similarly, metatranscriptomics and metaproteomics target the RNA and proteins obtained from such samples.

Technological advances in next-generation sequencing methods are fueling a rapid increase in the number and scope of environmental sequencing projects. In consequence, there is a dramatic increase in the volume of sequence data to be analyzed.

http://ab.inf.uni-tuebingen.de/software/megan/welcome.html

Basic computational questions

The first three basic computational tasks for such data are:taxonomic analysis, functional analysis and comparative analysis.

These are also known as the “who is out there?’, “what are they doing?” and “how do they compare?” questions. They pose an immense conceptual and computational challenge, and there is a great need for new bioinformatics tools and methods to address them.


History of MEGAN

MEGAN 1: (MEta Genome ANalyzer ) 2007, First stand-alone analysis tool for metagenomic short-read data 1

• Studying the taxonomic content of a single dataset.

MEGAN 2:• Comparative taxonomic analysis of multiple datasets.

MEGAN 3: • Provided a functional analysis of metagenome data, based on the GO ontology.

MEGAN 4: 2011, GO analyzer replaced by two new functional analysis methods2:• the SEED classification • KEGG (Kyoto Encyclopedia for Genes and Genomes).

bMEGAN 4 written by D. H. Huson, original design by D. H. Huson and S.C. Schuster, with contributions from S. Mitra, D.C. Richter, P. Rupek, H.-J. Ruscheweyh and N. Weber.


Algorithms in Bioinformatics

1MEGAN analysis of metagenomic data Genome Res. published online January 25, 2007Daniel H. Huson, Alexander F. Auch, Ji Qi, et al.

MEGAN was originally designed to analyze metagenomic and metatranscriptomic data. However, it is easily possible to analyze metaproteomic data as well. Please note that MEGAN can now be used to analyze sequencing reads obtained in an approach targeted at 16S rRNA sequences, as shown here:


METAGENOMIC ANALYSIS “TARGETED” 16S ANALYSIS

OMIT

Prepare a dataset for use with MEGAN:

First compare reads against a database of reference sequences, e.g. BLASTX search against the NCBI-NR database.

Reads file & resulting BLAST file can be directly imported into MEGANAutomatic taxonomic classification or functional classification, Uses SEED or KEGG classification, or both.

Multiple datasets can be opened simultaneously to provide comparative views


Getting started

MEGAN can be used to interactively explore the dataset. Figure shows assignment of reads to the NCBI taxonomy.

Each node is labeled by a taxon and the number of reads assigned to the taxon,

The size of a node is scaled logarithmically to represent the number of assigned reads.

Tree display options allow you to interactively drill down to the individual BLAST hits and to export all reads

One can select a set of taxa and then use MEGAN to generate different types of charts


Taxonomic analysis

MEGAN attempts to map each read to a SEED functional role by the highest scoring BLAST protein match with a known functional role.

SEED rooted trees are “multi-labeled” because different leaves may represent the same functional role (if it occurs in different types of subsystems)

The current complete SEED tree has about 13,000 nodes.


Functional analysis using the SEED classification

1http://www.theseed.org/wiki/Main_Page

SEED1 is a comparative genomics environment of curated genomic data. The following figure shows a part of the SEED analysis of a marine metagenome sample.

To perform a KEGG analysis, MEGAN attempts to match each read to a KEGG orthology (KO) accession number, using the best hit to a reference sequence

Reads are then assigned to enzymes and pathways. The KEGG classification is represented by a rooted tree whose leaves represent pathways.

Each pathway can also be inspected visually, for example the citric acid cycle (shown)

KEGG displays different participating enzymes by numbered rectangles. MEGAN shades each such rectangle is so as to indicate the number of reads assigned to the corresponding enzyme.


Functional analysis using the KEGG classification

http://www.genome.jp/kegg/pathway.html

MEGAN also supports the simultaneous analysis and comparison of the SEED functional content of multiple metagenomes,

A comparative view of assignments to a KEGG pathway is also possible.


MEGAN supports a number of different methods for calculating a distance matrix,

These can be visualized either using a split network calculated using the neighbor-net algorithm, or using a multi-dimensional scaling plot.

The figure we shows a comparison of eight marine datasets based on the taxonomic content of the datasets and computed using Goodall’s index.


MEGANs analysis window compares multiple datasets. This enables creating distance matrices for a collection of datasets using different ecological indices.

Computational comparison of metagenomes

MEGAN provides a comparison view that is based on a tree in which each node shows the number of reads assigned to it for each of the datasets.

This can be done either as a pie chart, a bar chart or as a heat map.


Comparative visualization

Once the datasets are all individually opened MEGAN provides a “compare” dialog.

The following figure shows the taxonomic comparison of all eight marine datasets.

Here, each node in the NCBI taxonomy is shown as a bar chart indicating the number of reads (normalized, if desired) from each dataset assigned to the node.

How to use BLAST My personal opinion about how to use BLAST

Preface

The opinions expressed here are solely my own, so please do not make me responsible for any problems related to using these recommendations. Use them at your own risk. :-) In the first version of this document, I will focus on describing relevant parameters of NCBI blast, because this is - at least at the moment - the implementation of the BLAST algorithm we are using for our Metagenomics analyses. WU-blast is another interesting variant, which I tend to use in more complicated szenarios like whole genome similarity searches or ITS sequence comparison, because it can be far better tuned to these settings than NCBI BLAST, at least according to my experience. But if you do not need this level of fine tuning, NCBI BLAST may be the better choice, since it is under active development and it is steadily enhanced by the NCBI programming guys.The settings discussed in the following paragraphs have been selected with very large databases in mind, like NT (blastn) or NR (blastx). With smaller databases, the expectation value threshold can and should be lowered.BLAST Parameters for query sequence lengths >= 100 bpWith query sequence lengths of 100 bps or more, NCBI BLAST works quite well with standard settings.According to experience, it is better to change blasts low complexity filtering strategy to soft filtering. This can be done by appending -F "m D" for blastn, or -F "m S" for blastx. Please note the quotes - these are necessary because there is a blank between the soft masking operator and the character indicating the filter type (DUST for nucleotides, SEG for protein sequences). Without these parameters, some high scoring pairs containing low complexity regions may break apart, leading to several HSPs with smaller scores. This is undesirable, since MEGAN could favor a HSP having a higher score than any of the fragmented HSPs, but a smaller score than the unbroken HSP...When using blastx, one has to decide which translation table to use for the query sequence. In most cases, I prefer code table 11 (option: -Q 11) for bacterial sequences providing some alternative start codons, but this is highly depending on the metagenomics sample. E.g., if you are largely dealing with Mollicutes, code table 4 should be preferred. For a description of the codes and their differences to the standard code, please read the NCBI documentation.There are several options allowing to decrease output size, namely changing the expectation value threshold with -e or reducing the maximum number of reported HSPs with -v or -b. In my personal opinion, I do not recommend using these parameters, since their impact on run time is only marginally or zero. Any filtering can be done later on, and no one wants to redo all blast runs because the estimated settings were too low...BLAST Parameters for short query sequencesFor searching sequence similarities within very short fragments, BLAST may not be the best choice. If you want to tackle this anyhow, the word size should be reduced to the minimum, and the expectation value should be adjusted as well. Minimal settings for word size are -W 7 for blastn, and -W 2 for blastx in conjunction with reducing the neighborhood word threshold score to -f 8 or below (this is only necessary for blastx). Expectation value should be -E 100. Yes, that's no joke. When comparing against large databases like NT or NR, such high amounts of expected random hits have to be accepted. A lower eValue threshold could be used when only nearly exact matches are desired.Additionally, when using blastx, the two-hit algorithm may be disabled by using -P 0, but I cannot say how much influence this parameter has, since I have not used it by myself till now...Please send me any comments, hints, feedback or criticism, so that I can improve this short document... ;-)Alexander Auch


OMIT

http://ab.inf.uni-tuebingen.de/people/auch/welcome.html

http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c



MEG2DIST MEG2DIST is a command-line Java program for computing different ecological indices from the taxonomic profiles for each pair combination of multiple metagenome datasets. The indices can be used for visualizing the relationships between multiple metagenomes. The 'Readme' file provides further information about System requirements, Installation process, Executing the program and Input-Output file formats.


OMIT?

Documents

Metagenomic Analysis Using MEGAN?. Introduction In metagenomics, the aim is to understand the composition and operation of complex microbial consortia