18
DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc., Ph.D. California Institute for Telecommunications & Information Technology (Calit2) University of California, San Diego Large-Scale Microbial Ecology Cyberinfrastructure (CAMERA)

DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Embed Size (px)

Citation preview

Page 1: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM

NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009

Paul Gilna, B.Sc., Ph.D.

California Institute for Telecommunications & Information Technology (Calit2)

University of California, San Diego

Large-Scale Microbial Ecology Cyberinfrastructure (CAMERA)

Page 2: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Global Scientific Research Cyber-Community

Page 3: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Global Scientific Research Cyber-Community

•3100 users•70 countries

Page 4: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

CAMERA 2.0 Objectives

• CAMERA serves as one representation of a specific

research community’s need for a system to- Provide a metadata rich family of scalable databases and make them

available to the community

- Collect and reference increasing metadata relevant to environmental

metagenome datasets

- Exploit the power of querying on metadata across multiple geospatial

locations

- Provide a facility that allows for a diversity of software tools to be easily

integrated into the system (and sufficient compute resources to support

these analyses)

Page 5: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

The Semantically Aware DB Schema

• Some key features of the semantically aware DB schema- Environmental parameters: Modeled more generally, to accommodate

any environment and any parameter within an environment

- Sequence: Separate “registries” for DNA, rRNA, mRNA, viral segments, reference genomes etc. Sequence annotations are independently searchable.

- Workflow Connection: Every computed property is associated with the workflow instance that created it.

- Associated Data : Data not produced in CAMERA but often used for analysis and comparison

- Ontologies: All metadata, measured and observed parameters are connected to ontologies, whenever possible.

Page 6: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Integration of External Data

• Warehousing- Reference genomes- Homologs, CoG clusters- Raster data from slow/complex servers

• Remote Data- KEGG pathways- NASA MODIS data- World Ocean Atlas- Other data that come as “data sets” that do

not conform to the schema

Page 7: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

NASA Aqua-MODIS satellite data

Metadata: beyond data collected at sampling site

Sea Surface Temp

Chlorophyll

MODIS Images covering

GOS sites #8 – 12, mid

November, 2003

Page 8: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Integration of Enhanced Metadata

Page 9: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Integrate and browse additional sources of microbial data

Page 10: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

CAMERA 2.0 (Data Submission)

Growing the CAMERA Community and Resource…

Page 11: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Investigator submits proposal to GBMF

Investigator submits metadata to CAMERA

CAMERA sends acknowledgement to Investigator, Seq. Group, GBMF

Seq. Group send barcoded sample “kit” to investigators Seq. Group

Upload data to CAMERA (& Investigator)

Data & Metadata Released in six months

•Metadata now collected before sequence data: GSC-compliant

•Project-ID serves as acceptance-proof

•Sample is Received and Sequenced

Webb Miller and Stephan C. Schuster,

and Roche / 454 Genome Sequencer

GBMF Data Acquisition Pipeline:A New Data Submission Paradigm-Metadata

First!

Page 12: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Data Standards

• Minimal Information for (Meta)Genomic Sequences: MIGS/MIMS

• A Metadata standard, developed by the Genomics Standards Consortium

-Controlled vocabularies e.g. EnvO, PATO-Common language: GCDML

• Submissions shall comply with a MIMS/MIGS core, but any metadata can be entered via keywords and free text

• Different metadata submission forms for different habitats: (water, soil, air, hosts)

Page 13: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

User Friendly Compute Environment

Page 14: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

CAMERA 2.0 (Computation)

From simple job submission to community developed and published workflows…

Page 15: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

RAMMCAP – Rapid clustering and functional annotation for metagenomic sequences

RNA finding/filtering

DNA Clustering• Unique sequence • Taxonomy / population analysis

ORF clustering • ORF calling• Unique sequences• Protein families

ORF and cluster annotation• Pfam, Tigrfam, COG, etc.

Features• Very fast (10-100x) as compared to BLAST-based methods• Effective tools: CD-HIT, HMMERHEAD, meta_RNA, and RPS-BLAST• Focused functional annotation via curated protein families

CD-HIT, 90-95%

More in-depth analysis and further annotation

MetagenomicRaw reads

CD-HIT-EST, 95%

DNAclusters

Proteinclusters

Representativesequences

Unique DNAsequences

ORF Annotation

1. ORF_finder2. Metagene

CD-HIT, 60 or 30%

COG

Pfam

Tigrfam

HMMER HMMERHEADRPS-BLAST

ClusterAnnotation

1. tRNA scan2. rRNA scan3. meta_RNA

ORFs

Non-redundantORFs

tRNAs

rRNAs

Page 16: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Annotation workflow

A green box is called an ‘actor’ , which performs a task.

This special actor represents an annotation component, such as BLAST search.

Workflow parameters, which can be specified by users in the portal, are passed to workflow components.

Data flow is divided.

Page 17: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Provenance of Workflow Related Data

• Provenance: A concept from art history and library- Inputs, outputs, intermediate results, workflow

design, workflow run

• Collected information - Can be used in a number of ways

- Validation, reproducibility, fault tolerance, etc…

- Linked to the semantic database

- Viewable and searchable from CAMERA 2.0

Page 18: DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

http://camera.calit2.net