Folker Meyer: Metagenomic Data Annotation

Folker MeyerArgonne National Laboratory and University of Chicago

June 14th, 1st EMP meeting Shenzhen, China

Metagenome Annotation

datadata

Metagenomics needs the magic wand..Metagenomics needs the magic wand..

• == “shotgun genomics applied directly to various environments” “shotgun metagenomics”

• != sequencing of BAC clones with env. DNA “functional metagenomics”

• != sequencing single genes (16 rDNA) “gene surveys”

Who are they?What are they doing?

Portals help with computational analysis• MG-RAST and IMG/M and CAMERA for metagenomes

– Provide complete project support including metadata input– Systems allow upload of sequence runs and provide QC, feature

identification, feature annotation, views and comparison– Systems provide lots of public samples to compare to

• MG-RAST: 4,000+ public samples (June 2011)– Google will reveal URLs

• QIIME for amplicon studies– Provides support for amplicon analysis– Large number of public amplicon samples– Advanced visualization capabilities with rich metadata– Integration with other tools including MG-RAST

2010 state of metagenomics2010 state of metagenomics

• 8492 metagenomes from > 500 groups• Over 20GB per week (rapid growth)• Many centers produce data• This was a few weeks ago

• 8492 metagenomes from > 500 groups• Over 20GB per week (rapid growth)• Many centers produce data• This was a few weeks ago

2011: many small scale 2011: many small scale projectsprojects

V303/201

1

V303/201

1

• ~25,000 data sets, hundreds of groups• ~4000 public, with metadata, 45GBp• >> 1Terabase (10^12 basepairs)

• ~25,000 data sets, hundreds of groups• ~4000 public, with metadata, 45GBp• >> 1Terabase (10^12 basepairs)

Even data upload is hard! Jumploader

Thanks to Rob Knight’s team to pointing us there

Part of an emerging digital biologyPart of an emerging digital biology

• Users (dots) sharing pre-publication metagenomes (edges)

Source: MG-RAST, 800+ shared metagenomes

Computing cost dominate

Source: Rob Knight, UColorado

“Living on the log scale” (Guy Cochrane, EBI, UK)

From: Wilkening et al., IEEE Cluster09, 2009

computingsequencing

Challenges during shotgun metagenome analysis1. Quality Control2. Finding features3. Characterizing features4. Presentation

Quality control for de-novo sequencing

• Question is simple: How trustworthy is my data?– “rare biosphere debate” de-noising for amplicon runs– No such tool for shotgun data

• Existing QC approaches rely on:– Using reference sequences– Using vendor specific scores

• Includes e.g. phred scores• None of those are suitable to what we are doing• EMP needs novel quality control to ensure comparisons work

Approaches utilizing artifacts of sequencing and library prep processes show promising results

Tell me if my data set is of type A or B

A) •Lots of error ~10% at 70bp

Real data sets from MG-RAST

B) •Errors only at tail

K. Keegan, in preparation

% duplicates varies also

Finding features

• Protein coding features– Statistics based approaches:

• Using e.g. codon usage trained on existing genomes• MGA, Metagene, FragGeneScan, Prodigal, MetageneMarkHMM• Limitation: novel proteins are harder, islands and transferred also

– Similarity based approaches• Blastx search against • Limitation: Runtime + Novel proteins will never be found….

• Running more specialized tools e.g. RFAM is often not feasible for large scale data sets

EMP will enable systematic search for novel proteins (think of CRISPRs from AMD)

Performance Analysis on simulated data sets w/ errors

W. Trimble, in preparation

Characterizing features

• Describe sequences by comparison to existing databases– GenBank, GO, KEGG, COGs, SEED, STRINGS, ..– Use sequence similarity to define – Function: function string(s), EC number, GO number, …– Taxonomic origin

• Algorithms (not exhaustive)– BLAST (default, sensitive, too expensive)– BLAT (well tested, no parallel, a bit less sensitive)– Suffix array based (fast, limited mis-matches)– HMM based (HMMer 3.0 is as fast a BLAST)– We haven’t tested RAPsearch2

• Similarity search cost is high, repeat searches are required– Think of Nikos’ MEP (next talk)

Presentation layerPresentation layer

MG-RAST v3 workflow (simplified)

Upload

QC / normalization

Similarities (Parallel Blat)

Metabolic reconstruction

Community reconstruction

SFF, fastq and fasta data

find emPCR and BridgePCR artifacts

Metadata

Feature prediction (FGS)

find coding regions/peptides using FragGeneScan (Ye, NAR 2010)

Abundance profiles

Metabolic model

Many databases integratedGSC’s M5nr

The future

Source: Rob Knight, UColorado

Driving forceDriving force

• “Living on the log scale” (Guy Cochrane, EBI, UK)• “Data bonanza” (Dawn Field, Oxford UK)• “Metadata are essential for turning data into knowledge” (Rob Knight, U

Colorado, USA)

600 GBp / run

600 GBp / run

60 GBp / run

60 GBp / run

Future 1: World is not clonal, study strain/species variation • “Pangenome view” allows definition of strains

new strain?new strain?

Future 2: Expand metadata (1): MIMS/MIMARKS

19

• Genomics Standards Consortium (GSC) provides• Extensible metadata standards• Environmental packages allow domain specific extension

• Groups starting to build environmental packages• MG-RAST v3 supports GSC metadata standards

• Genomics Standards Consortium (GSC) provides• Extensible metadata standards• Environmental packages allow domain specific extension

• Groups starting to build environmental packages• MG-RAST v3 supports GSC metadata standards

Use metadataSelect data sets to compare to based on:- Biome, location, sampling procedure, …

Use metadataSelect data sets to compare to based on:- Biome, location, sampling procedure, …

Capture metadataExtensive metadata questionnaire

supporting offline editors // input

Capture metadataExtensive metadata questionnaire

supporting offline editors // input

Expand metadata support (2): Capture metadata early

20

Imagine adding metadata to the plot below:

Very hard after the fact!

capture metadata early

Aanensen et al, Plos ONE, 2009

Many current challenges and pitfalls

21

• Assembly (state of the art: hard)– Several groups are working actively on metagenome assemblers– Quotes Mihai Pop (UMaryland)

• “metagenomes can’t be assembled” and “all assemblers are equal”• Rare k-Filtering (state of the art: DO NOT)

– C. Titus Brown (MSU): “Friends don’t let friends filter rare k-mers”• Binning (state of the art: use k-mers)

– Traditional binning does not work for short reads (Alice C. McHardy)– K-mer based binning can produce organism sized bins Titus’ work

• Sequence quality– Quality really matters and vendors lie all the time

• Metadata – challenge for the next few years is to add metadata

• Cloud computing – does not change the cost structure

Metagenome transport format (MTF)

• Input Sequences (“from the machine”) ▫ FASTA, FASTQ, SFF (maybe Archive BAM)• Transformed sequences (“after QC”) ▫ FASTA• Feature coordinates (“after genefinding”) ▫ GFF3/GTF• Similarities (“ the BIG computation “) ▫ Blast/BLAT/.. results• Metadata ( context, “ the important stuff “) ▫ GSC compliant MIMS format • Workflow description ( “provenance” ) ▫ What did we do? (not in shell script !) ▫ What version of code // databases did we use ▫ Who computed where

Acknowledgements

MG-RAST team• Daniela Bartels• Narayan Desai • Mark d’Souza• Elizabeth M. Glass• Travis Harrison• Kevin Keegan• Tobias Paczian• William Trimble• Andreas Wilke• Jared Wilkening

Metadata:• Dawn Field, Oxford• Renzo Kottmann, MPI Bremen

o and all of GSCM5/QC collaboration with • Nikos Kyrpides, JGI• Kostas Konstantinidis, JGI

M5 standards• Sarah Hunter, EBI

CLOVR• Sam Anguielo, Owen White

(HMP DACC)QIIME• Rob Knight (Colorado)

INSDC submission/archiving• Guy Cochrane/EBI

Thank you for your attention

24

Technology

Folker Meyer: Metagenomic Data Annotation