24
Folker Meyer Argonne National Laboratory and University of Chicago June 14 th , 1 st EMP meeting Shenzhen, China Metagenome Annotation

Folker Meyer: Metagenomic Data Annotation

Embed Size (px)

DESCRIPTION

Folker Meyer's talk from the 1st Earth Microbiome Project meeting in Shenzen.

Citation preview

Page 1: Folker Meyer: Metagenomic Data Annotation

Folker MeyerArgonne National Laboratory and University of Chicago

June 14th, 1st EMP meeting Shenzhen, China

Metagenome Annotation

Page 2: Folker Meyer: Metagenomic Data Annotation

datadata

Metagenomics needs the magic wand..Metagenomics needs the magic wand..

• == “shotgun genomics applied directly to various environments” “shotgun metagenomics”

• != sequencing of BAC clones with env. DNA “functional metagenomics”

• != sequencing single genes (16 rDNA) “gene surveys”

Who are they?What are they doing?

Page 3: Folker Meyer: Metagenomic Data Annotation

Portals help with computational analysis• MG-RAST and IMG/M and CAMERA for metagenomes

– Provide complete project support including metadata input– Systems allow upload of sequence runs and provide QC, feature

identification, feature annotation, views and comparison– Systems provide lots of public samples to compare to

• MG-RAST: 4,000+ public samples (June 2011)– Google will reveal URLs

• QIIME for amplicon studies– Provides support for amplicon analysis– Large number of public amplicon samples– Advanced visualization capabilities with rich metadata– Integration with other tools including MG-RAST

Page 4: Folker Meyer: Metagenomic Data Annotation

2010 state of metagenomics2010 state of metagenomics

• 8492 metagenomes from > 500 groups• Over 20GB per week (rapid growth)• Many centers produce data• This was a few weeks ago

• 8492 metagenomes from > 500 groups• Over 20GB per week (rapid growth)• Many centers produce data• This was a few weeks ago

Page 5: Folker Meyer: Metagenomic Data Annotation

2011: many small scale 2011: many small scale projectsprojects

V303/201

1

V303/201

1

• ~25,000 data sets, hundreds of groups• ~4000 public, with metadata, 45GBp• >> 1Terabase (10^12 basepairs)

• ~25,000 data sets, hundreds of groups• ~4000 public, with metadata, 45GBp• >> 1Terabase (10^12 basepairs)

Page 6: Folker Meyer: Metagenomic Data Annotation

Even data upload is hard! Jumploader

Thanks to Rob Knight’s team to pointing us there

Page 7: Folker Meyer: Metagenomic Data Annotation

Part of an emerging digital biologyPart of an emerging digital biology

• Users (dots) sharing pre-publication metagenomes (edges)

Source: MG-RAST, 800+ shared metagenomes

Page 8: Folker Meyer: Metagenomic Data Annotation

Computing cost dominate

Source: Rob Knight, UColorado

“Living on the log scale” (Guy Cochrane, EBI, UK)

From: Wilkening et al., IEEE Cluster09, 2009

computingsequencing

Page 9: Folker Meyer: Metagenomic Data Annotation

Challenges during shotgun metagenome analysis1. Quality Control2. Finding features3. Characterizing features4. Presentation

Page 10: Folker Meyer: Metagenomic Data Annotation

Quality control for de-novo sequencing

• Question is simple: How trustworthy is my data?– “rare biosphere debate” de-noising for amplicon runs– No such tool for shotgun data

• Existing QC approaches rely on:– Using reference sequences– Using vendor specific scores

• Includes e.g. phred scores• None of those are suitable to what we are doing• EMP needs novel quality control to ensure comparisons work

Approaches utilizing artifacts of sequencing and library prep processes show promising results

Page 11: Folker Meyer: Metagenomic Data Annotation

Tell me if my data set is of type A or B

A) •Lots of error ~10% at 70bp

Real data sets from MG-RAST

B) •Errors only at tail

K. Keegan, in preparation

% duplicates varies also

Page 12: Folker Meyer: Metagenomic Data Annotation

Finding features

• Protein coding features– Statistics based approaches:

• Using e.g. codon usage trained on existing genomes• MGA, Metagene, FragGeneScan, Prodigal, MetageneMarkHMM• Limitation: novel proteins are harder, islands and transferred also

– Similarity based approaches• Blastx search against • Limitation: Runtime + Novel proteins will never be found….

• Running more specialized tools e.g. RFAM is often not feasible for large scale data sets

EMP will enable systematic search for novel proteins (think of CRISPRs from AMD)

Page 13: Folker Meyer: Metagenomic Data Annotation

Performance Analysis on simulated data sets w/ errors

W. Trimble, in preparation

Page 14: Folker Meyer: Metagenomic Data Annotation

Characterizing features

• Describe sequences by comparison to existing databases– GenBank, GO, KEGG, COGs, SEED, STRINGS, ..– Use sequence similarity to define – Function: function string(s), EC number, GO number, …– Taxonomic origin

• Algorithms (not exhaustive)– BLAST (default, sensitive, too expensive)– BLAT (well tested, no parallel, a bit less sensitive)– Suffix array based (fast, limited mis-matches)– HMM based (HMMer 3.0 is as fast a BLAST)– We haven’t tested RAPsearch2

• Similarity search cost is high, repeat searches are required– Think of Nikos’ MEP (next talk)

Page 15: Folker Meyer: Metagenomic Data Annotation

Presentation layerPresentation layer

Page 16: Folker Meyer: Metagenomic Data Annotation

MG-RAST v3 workflow (simplified)

Upload

QC / normalization

Similarities (Parallel Blat)

Metabolic reconstruction

Community reconstruction

SFF, fastq and fasta data

find emPCR and BridgePCR artifacts

Metadata

Feature prediction (FGS)

find coding regions/peptides using FragGeneScan (Ye, NAR 2010)

Abundance profiles

Metabolic model

Many databases integratedGSC’s M5nr

Page 17: Folker Meyer: Metagenomic Data Annotation

The future

Source: Rob Knight, UColorado

Driving forceDriving force

• “Living on the log scale” (Guy Cochrane, EBI, UK)• “Data bonanza” (Dawn Field, Oxford UK)• “Metadata are essential for turning data into knowledge” (Rob Knight, U

Colorado, USA)

600 GBp / run

600 GBp / run

60 GBp / run

60 GBp / run

Page 18: Folker Meyer: Metagenomic Data Annotation

Future 1: World is not clonal, study strain/species variation • “Pangenome view” allows definition of strains

new strain?new strain?

Page 19: Folker Meyer: Metagenomic Data Annotation

Future 2: Expand metadata (1): MIMS/MIMARKS

19

• Genomics Standards Consortium (GSC) provides• Extensible metadata standards• Environmental packages allow domain specific extension

• Groups starting to build environmental packages• MG-RAST v3 supports GSC metadata standards

• Genomics Standards Consortium (GSC) provides• Extensible metadata standards• Environmental packages allow domain specific extension

• Groups starting to build environmental packages• MG-RAST v3 supports GSC metadata standards

Use metadataSelect data sets to compare to based on:- Biome, location, sampling procedure, …

Use metadataSelect data sets to compare to based on:- Biome, location, sampling procedure, …

Capture metadataExtensive metadata questionnaire

supporting offline editors // input

Capture metadataExtensive metadata questionnaire

supporting offline editors // input

Page 20: Folker Meyer: Metagenomic Data Annotation

Expand metadata support (2): Capture metadata early

20

Imagine adding metadata to the plot below:

Very hard after the fact!

capture metadata early

Aanensen et al, Plos ONE, 2009

Page 21: Folker Meyer: Metagenomic Data Annotation

Many current challenges and pitfalls

21

• Assembly (state of the art: hard)– Several groups are working actively on metagenome assemblers– Quotes Mihai Pop (UMaryland)

• “metagenomes can’t be assembled” and “all assemblers are equal”• Rare k-Filtering (state of the art: DO NOT)

– C. Titus Brown (MSU): “Friends don’t let friends filter rare k-mers”• Binning (state of the art: use k-mers)

– Traditional binning does not work for short reads (Alice C. McHardy)– K-mer based binning can produce organism sized bins Titus’ work

• Sequence quality– Quality really matters and vendors lie all the time

• Metadata – challenge for the next few years is to add metadata

• Cloud computing – does not change the cost structure

Page 22: Folker Meyer: Metagenomic Data Annotation

Metagenome transport format (MTF)

• Input Sequences (“from the machine”)    ▫ FASTA, FASTQ, SFF (maybe Archive BAM)• Transformed sequences (“after QC”)    ▫ FASTA• Feature coordinates (“after genefinding”)    ▫ GFF3/GTF• Similarities (“ the BIG computation “)    ▫ Blast/BLAT/.. results• Metadata ( context, “ the important stuff “)    ▫ GSC compliant MIMS format • Workflow description ( “provenance” )    ▫ What did we do? (not in shell script !)    ▫ What version of code // databases did we use    ▫ Who computed where

Page 23: Folker Meyer: Metagenomic Data Annotation

Acknowledgements

MG-RAST team• Daniela Bartels• Narayan Desai • Mark d’Souza• Elizabeth M. Glass• Travis Harrison• Kevin Keegan• Tobias Paczian• William Trimble• Andreas Wilke• Jared Wilkening

Metadata:• Dawn Field, Oxford• Renzo Kottmann, MPI Bremen

o and all of GSCM5/QC collaboration with • Nikos Kyrpides, JGI• Kostas Konstantinidis, JGI

M5 standards• Sarah Hunter, EBI

CLOVR• Sam Anguielo, Owen White

(HMP DACC)QIIME• Rob Knight (Colorado)

INSDC submission/archiving• Guy Cochrane/EBI

Page 24: Folker Meyer: Metagenomic Data Annotation

Thank you for your attention

24