Upload
leah-ellis
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
The CAMERA Project
Metagenomics 2006
Oct 3-5, 2006
Paul Gilna, Calit2, UCSD
The CAMERA Partnership
Community Cyberinfrastructure for
Advanced Marine Microbial Ecology
Research and Analysis
Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale…
GenBank Protein Data Bank
www.rcsb.org/pdb/holdings.htmlwww.ncbi.nlm.nih.gov/Genbank
100 Billion Bases!
Total Data < 1TB
35,000 Structures
The Sargasso Sea Experiment The Power of Environmental Metagenomics
• Yielded a Total of Over 1 billion Base Pairs of Non-
Redundant Sequence
• Displayed the Gene Content, Diversity, & Relative
Abundance of the Organisms
• Sequences from at Least 1800 Genomic Species,
including 148 Previously Unknown
• Identified over 1.2 Million Unknown Genes
MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from
22 February 2003
J. Craig Venter, et al.
Science 2 April 2004:
Vol. 304. pp. 66 - 74
Full Genome Sequencing is Exploding:Most Sequenced Genomes are Bacterial
Total 422
Completed GenomesArchaeal
Bacterial
Eukaryal
Total 1665
Ongoing Genomes
www.genomesonline.org
55Metagenomes
First Genome 1995 6 Genomes/ Year 2000
Moore 155 In Here
Moore Microbial Genome Sequencing ProjectSelected Microbes Throughout the World’s Oceans
www.moore.org/microgenome/worldmap.asp
Microbes Nominated by Leading Ocean Microbial
Biologists
Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute
Marine Genome Sequencing ProjectMeasuring the Genetic Diversity of Ocean Microbes
Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale…
GenBank Protein Data Bank
www.rcsb.org/pdb/holdings.htmlwww.ncbi.nlm.nih.gov/Genbank
100 Billion Bases!
Total Data < 1TB
35,000 Structures
Metagenomics Will Couple to Earth Observations Which Add Several TBs/Day
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,00020
01
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Calendar Year
Cu
mu
lati
ve T
era
Byt
es
Other EOSHIRDLSMLSTESOMIAMSR-EAIRS-isGMAOMOPITTASTERMISRV0 HoldingsMODIS-TMODIS-A
Other EOS =• ACRIMSAT• Meteor 3M• Midori II• ICESat• SORCE
file name: archive holdings_122204.xlstab: all instr bar
Terra EOMDec 2005
Aqua EOMMay 2008
Aura EOMJul 2010
NOTE: Data remains in the archive pending transition to LTA
Source: Glenn Iona, EOSDIS Element Evolution Technical Working Group January 6-7, 2005
Driven by User Needs• CAMERA serves as one representation of a specific research
community’s need for a system to– Collect and reference increasing metadata relevant to environmental metagenome
datasets
– Exploit the power of querying on metadata across multiple geospatial
locations
– Have access to a diverse and customizable set of easy-to-use tools to analyze
their data
– Have ability to add, update and propagate improvements to annotations
– Have a pre-publication, pre-submission collaborative workspace
– Serve a diverse levels of informatics literacy
Services Provided
• Data and Application Services
• Tools and Workflows
• Computational Data, Visualization and
Collaborative environment
• Outreach and Training in Environmental Genomics
Data and Application Services• Primary Data
– Sargasso Sea and Sorcerer II expedition data
– JGI marine & terrestrial environmental datasets
– Moore Microbial Genomes
– JGI and other relevant whole genomes
– Research community submitted datasets
– Submitted 454-based metagenomic datasets
– Publicly available NR protein and DNA sequence datasets
• Derived Data
– Annotations of datasets
– Assemblies
– Alignments
– Pre-computed clusters
Sample Metadata from GOS
• Site Metadata
– Location (lat/long, water depth)
– Site characterization (finite list of types plus “other”)
– Site description (free text)
– Country
• Sampling Metadata– Sample collection date/time
– Sampling depth
– Conditions at time of sampling (e.g., stormy, surface temperature)
– Sample physical/chemical measurements (T (oC), S (ppt), chl a (mg m-3), etc)
– “author”
• Experimental Parameters– Filter size
– Insert size
Tools and Workflows• Initial set
– BLAST Server
– Clustering
– HMM/Profile
– Neighborhood analysis
– Multiple sequence alignments
– Assembly
• Proposed New Tools
– Multiple Auto Annotation pipelines
– Fast Sequence lookup
– Customized Assembly
– Phylogenetic Analysis
– Clustering Tools
Guiding Philosophy for Development
• Sprint Q4 2006– Propagate JCVI toolkit and data ASAP
– Mechanism for publication of Sorcerer II data– Enabler for community
– Defined deliverables, project management approach
• Marathon Q4 2006 onward– Additional Datasets– Additional tools– Community drives prioritization for ongoing releases
– Advisory Board, Community Outreach
• Keys to success: Tight integration of science, bioinformatics, software, and IT Matched to Community Needs
The Future Home of the Moore Foundation Funded Marine Microbial Ecology Metagenomics Complex
First Implementation of the CAMERA Complex
Photo Courtesy Joe Keefe, Calit2
Major Buildout of Calit2 Server Room Underway
http://calit2-1101-1.ucsd.edu/
Moore CAMERAProduction Environment
• Creation of Initial Production Environment – September 2006
– Hardware– Compute Nodes –
– ~200 4 CPU Nodes = ~800 Processing Cores
– Storage Servers –– 10 systems = ¼ Petabyte raw storage
– Database Servers – Larger 20-40TB; Smaller 5-10TB
– Network Management – – Force10 E1200 Router w/12 10GigE Interfaces to Each System Ports
• User Access to Compute Cycles– Bulk of free cycles available to external users
– Proposal mechanism in process
Source: Greg Hidley, Calit2; Phil Papadopoulos, SDSC, Calit2
www.glif.is
Created in Reykjavik, Iceland 2003
Countries are Aggressively Creating Gigabit Services:Interactive Access to CAMERA and LOOKING Systems
Visualization courtesy of Bob Patterson, NCSA.
CAMERA Outreach Modes
• Scientific Advisory Board – Early Adopters – OptIPortal End Points
• Targeted Workshops – User Forums – User Software Testing– Viz Tool Brainstorming
• Presentations at Scientific Meetings– Talks, posters, eventually demonstration booths
• Partnerships With Metagenomics Projects– E.g. DoE’s Joint Genome Institute (JGI)
• Training and User Services Team
A Near Future Metagenomics Fiber Optic-Enabled Data Generator
Source John Delaney, UWash