DOE Genomics: GTL ProgramDOE Genomics: GTL ProgramIT Infrastructure Needs for IT Infrastructure Needs for
Systems BiologySystems Biology
David G. ThomassenOffice of Biological and Environmental Research
DOE Office of ScienceMarch 22, 2004
Experimental:•Complete datasets•Quantitative measurements•Comprehensive physical characterization:
Protein expression and interactions Spatial distributions Process kinetics
Computational:•Automated data analysis and validation•Automated integration of diverse data sets•Human and computer-accessible databases•Molecular, Pathway and cell-level
simulations
The goals require a new synergy
between computing
and biology.
Ultimate Goal is to Provide Ultimate Goal is to Provide Predictive Models of MicrobesPredictive Models of Microbes
This goal drives data collection and computing strategy.
GTL Experiment TemplateGTL Experiment TemplateGenerating Petascale Data SetsGenerating Petascale Data Sets
Experiment templates for a single microbe
class of experiment
time points treatments conditions
genetic variants
biological replication
total biological samples
Proteomics data volume in TB
Metabolite data in TB
Transcription data in TB
simple (scratching the surface) 10 1 3 1 3 90 18.0 13.5 0.018moderate 25 3 5 1 3 1125 225.0 168.8 0.225upper mid 50 3 5 5 3 11250 2250.0 1687.5 2.25complex 20 5 5 20 3 30000 6000.0 4500.0 6real interesting 20 5 5 50 3 75000 15000.0 11250.0 15
Profiling methodProteomics Looking at a possible 6000 proteins per microbe assuming ~200 GB per sample Metabolites Looking a panel of 500-1000 different molecules assuming ~150GB per sampleTranscription 6000 genes & 2 arrays per sample ~100 MB
Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples
While this example does not account for data processing and compression it illustrates how even simple raw data storage will quickly become a bottleneck for biologists.
ATCGTAGCAATCGACCGT...CGGCTATAGCCGTTACCG…TTATGCTATCCATAATCGA...GGCTTAATCGCATACGAC...
Capacity: e.g., High-throughput protein structure predictions, data analysis, sequence comparison
Thread ontotemplates
Bestmatch
Capability: e.g., Large scale biophysical simulations, stochastic regulatory simulations:
Large size and timescale classical simulations
Highly accurate quantum mechanical simulations
GTL Science will Require High GTL Science will Require High Performance Computing for Both Performance Computing for Both CapacityCapacity and and Capability Capability ProblemsProblems
Petascale Petascale CapacityCapacity Problems in BiologyProblems in Biology
Microbial and Community Genome Annotation
Analyze and annotate 20 microbial genomes - (720,000 processor hours)
Now
In 5 years
Assemble, analyze and annotate community of 200 microbes and phage (10,000,000 processor hours)
Compare genome sequences (200 megabases)to previous genomes (4 gigabases) (5,000,000 processor hours)
Petascale Petascale CapabilityCapability Problems in Problems in BiologyBiology
Membrane channel simulation
Simulate non-flexible protein ion channelK+ flow using quantum methods (2,200,000) processor hours for 4 second simulation
Now
In 5 years
Simulate flexible protein ion pumpfor producing ATP from K+ gradient(15,000,000 processor hours for 200nanosecond simulation
2. Data Capture and Archiving
4. Modeling and Simulation
3. Data Analysis / Reduction
1. LIMS & Workflow Management
5. The Community Data Resource
Computing Capabilities for GTL Facilities and Projects
6. I
nfr
astr
uct
ure
CollaborativeProjects
Facilities
High-Performance Computing High-Performance Computing Roadmap for the Roadmap for the Genomics: GTL ProgramGenomics: GTL Program
Biological Complexity
ComparativeGenomics
Constraint-BasedFlexible Docking
1000 TF
100 TF
10 TF
1 TF*
Constrained rigid
docking
Genome-scale protein threading
Community metabolic regulatory, signaling simulations
Molecular machine classical simulation
Protein machineInteractions
Cell, pathway, and network
simulation
Molecule-basedcell simulation
*Teraflops
Current U.S. Computing
Genomics: GTL – A Vision of Genomics: GTL – A Vision of Systems Biology ResearchSystems Biology Research
In 10-15 years we would like to be able to start with a microbe or microbial community of interest and in a matter of days or weeks:
• Generate an annotated DNA sequence
• Produce proteins and molecular tags for most/all proteins
• Identify the majority of multi protein complexes
• Generate a working regulatory network model
• Identify the biochemical capabilities
• Design reengineering or control strategies in silico
Capabilities Needed:
• Map experimental strategies to distributed resources and instrument protocols
• Coordinate experimental process management across cyber collaboratories
• Track the process - sample tracking metadata
• Dynamically optimize experiment workflow
• Process and controls documentation / QA
• Localize problems with data production quality
• Share process data across facilities or projects
• Make production-scale collaborative science possible
1. LIMS and Workflow Management
Track and capture metadata
R & D Challenges and Technologies
Approaches to coordinated process design, optimization, protocol mapping for a large distributed enterprise
Explore LIMS and workflow management systems technology including commercial systems – modify?
Explore approaches to process documentation and control, QA/QC, and process metadata representation – make data reproducible
Develop Collaborative tools, electronic notebooks, web servers for shared access to laboratory data
1. LIMS and Workflow Management
Capabilities Needed:
•Capture bulk data and metadata from many different measurements and instruments in shared large-scale data archives
•Represent Complex Non-standard Data types: mass spectrometry, light microscopy, cryo EM, expression, biophysical & biochemical characterization data…
•Capture and represent data quality, statistical reliability measures, process metadata
•Support deposition, access, transfer and retrieval for archives of multi-petabyte size
Raw data sets
Swimming in Data
2. Data Capture and Archiving
Developing representations and models for data and metadata from many different measurements and assays;
confocal images, video, mass spec, 3D Cryo-EM, . . .
Developing data exchange and format standards for facilities and the community
Hardware infrastructure for rapid and flexible access to very large (petabyte) data volumes. Research new data storage technologies.
Research approaches to design, query and retrieval efficiency in large datasets and with non-standard data types
R & D Challenges and Technologies
2. Data Capture and Archiving
Raw data sets
Capabilities Needed:
Process data from instruments such as mass spectrometers, microscopes, NMR, etc., to reduce and analyze data; e.g.;
•Automatically identify interacting protein events in FRET confocal microscopy
•Identify peptides, proteins, PTMs of interest in mass spectrometry data
•Quantitate changes in / cluster expression data from arrays or mass spectrometry
•Compare metabolite levels under different cell conditions
3. Data Analysis and Reduction
R & D Challenges and technologies
Many types of data, each with algorithm research and development challenges for analyzing data, basic algorithm research needed! e.g.;
- Automated processing of images and video about protein cell localization to achieve analysis high-throughput
- New mass spectrometry algorithms to identify post-translational modifications, cross-linked peptides, and new proteins (De Novo MS), and to automate quantitiation
- Analysis of NMR, Scattering, AFMs . .
Analysis throughput likely to be an issue; Research on Grid analysis approaches and codes for large clusters and MPP environments
Approaches to Tools Libraries and Repositories
Develop and adopt software engineering principles and practices for GTL software development; modular, open source
3. Data Analysis and Reduction
Capabilities Needed:
Build models of biology that capture our knowledge, based on a combination of experimental data types, and validate these models, use them to predict. e.g.;
•Build regulatory network topology from observations of protein expression based on conditions
•Build a protein-protein interaction network from protein interaction data of several types
•Build a model for the organization of a protein complex from homology modeling, geometry constraints from mass spec, and cryo-EM images
•Build cell models that combine regulation, metabolism and protein interactions
4. Modeling and Simulation
R & D Challenges
& TechnologiesSynthesis; How to infer or reconstruct systems
from data – build “optimal” model
Metabolic pathways from metabolic data &
genome
Regulatory networks from expression data
Protein interaction networks from binary interaction data
How to integrate different types of data into models
Integration of different imaging modalities
Integration of metabolism, regulation, and protein interactions into cell models
How to derive best interaction networks from raw binary interaction data, cell interaction images, predicted interactions, and co-expression data . . .
4. Modeling and Simulation
•Capture human modesof integrationto automate
R & D Challenges (cont’d)
How to mathematically represent biology – pathways, networks, communities
What’s the right calculus to describe regulation / metabolism / protein interaction networks / signaling / that allows quantitative prediction?
Differential equations?Stochastic or deterministic?Control theory or Ad hoc mathematical networks?Binary or discrete value networks?Chaos theory?“Need for new abstractions”
In what regimes do they work and where they fail?
How do we deal with missing data, incomplete knowledge, or errors?
Are there organizing principles or theory that could make us successful with incomplete knowledge?
How to get to longer compute times for physics based simulations (millisecs and beyond)- steer and sample
4. Modeling and Simulation
Capabilities Needed:
• Provide community access to data, models, simulations, and protocols for GTL. Allow users to query and visualize data, use models, run simulations.
• Community resources for multiple types of data - machines, interactions, process models, expression, regulation, genome annotation, metabolism, regulation,…
• Access to:• data• protocols and methods• analysis tools and user environments • models and simulations
• Access to multiple levels of data - raw data, processed results, dynamic models
• Integrated view of the biology represented
• Guide experimental design strategy for next microbe
“The GTL Knowledge Base”
5. Community Data Resource
R & D Challenges and Technologies
Design and Integration of the major databases
Huge data volumes, great schema complexity - need for new types of databases (hardware and software)
Database technologies – object-relational, graph DBs, …Data standards, representations, ontologies for very complex objects
User Access Systems for browsing, query, visualization, and to run analysis or simulations
Supporting Simulation from DBs - how to allow users to utilize models and run simulations; how to link simulations to underlying data
Integration - Provide integrated view of the biology - With data from other community sources.
Community access to compute power to run long time-scale simulations
IP issues and reward system
How to represent incomplete, sparse, conflicting data
5. Community Data Resource
Objective: Provide hardware and software environments to support analysis, data storage, modeling and simulation activities required in GTL
Examples of Infrastructure:
• Hardware, network and operative system environments for peta-scale data storage and retrieval.
• Grid computing environments to support distributed large-scale data analysis operations.
• Massively parallel architectures for systems simulation.
• Discrete mathematics libraries
6. Infrastructure
http://DOEGenomesToLife.orghttp://DOEGenomesToLife.org