Download ppt - DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of

DOE Genomics: GTL ProgramDOE Genomics: GTL ProgramIT Infrastructure Needs for IT Infrastructure Needs for

Systems BiologySystems Biology

David G. ThomassenOffice of Biological and Environmental Research

DOE Office of ScienceMarch 22, 2004

Experimental:•Complete datasets•Quantitative measurements•Comprehensive physical characterization:

Protein expression and interactions Spatial distributions Process kinetics

Computational:•Automated data analysis and validation•Automated integration of diverse data sets•Human and computer-accessible databases•Molecular, Pathway and cell-level

simulations

The goals require a new synergy

between computing

and biology.

Ultimate Goal is to Provide Ultimate Goal is to Provide Predictive Models of MicrobesPredictive Models of Microbes

This goal drives data collection and computing strategy.

GTL Experiment TemplateGTL Experiment TemplateGenerating Petascale Data SetsGenerating Petascale Data Sets

Experiment templates for a single microbe

class of experiment

time points treatments conditions

genetic variants

biological replication

total biological samples

Proteomics data volume in TB

Metabolite data in TB

Transcription data in TB

simple (scratching the surface) 10 1 3 1 3 90 18.0 13.5 0.018moderate 25 3 5 1 3 1125 225.0 168.8 0.225upper mid 50 3 5 5 3 11250 2250.0 1687.5 2.25complex 20 5 5 20 3 30000 6000.0 4500.0 6real interesting 20 5 5 50 3 75000 15000.0 11250.0 15

Profiling methodProteomics Looking at a possible 6000 proteins per microbe assuming ~200 GB per sample Metabolites Looking a panel of 500-1000 different molecules assuming ~150GB per sampleTranscription 6000 genes & 2 arrays per sample ~100 MB

Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples

While this example does not account for data processing and compression it illustrates how even simple raw data storage will quickly become a bottleneck for biologists.

ATCGTAGCAATCGACCGT...CGGCTATAGCCGTTACCG…TTATGCTATCCATAATCGA...GGCTTAATCGCATACGAC...

Capacity: e.g., High-throughput protein structure predictions, data analysis, sequence comparison

Thread ontotemplates

Bestmatch

Capability: e.g., Large scale biophysical simulations, stochastic regulatory simulations:

Large size and timescale classical simulations

Highly accurate quantum mechanical simulations

GTL Science will Require High GTL Science will Require High Performance Computing for Both Performance Computing for Both CapacityCapacity and and Capability Capability ProblemsProblems

Petascale Petascale CapacityCapacity Problems in BiologyProblems in Biology

Microbial and Community Genome Annotation

Analyze and annotate 20 microbial genomes - (720,000 processor hours)

Now

In 5 years

Assemble, analyze and annotate community of 200 microbes and phage (10,000,000 processor hours)

Compare genome sequences (200 megabases)to previous genomes (4 gigabases) (5,000,000 processor hours)

Petascale Petascale CapabilityCapability Problems in Problems in BiologyBiology

Membrane channel simulation

Simulate non-flexible protein ion channelK+ flow using quantum methods (2,200,000) processor hours for 4 second simulation

Now

In 5 years

Simulate flexible protein ion pumpfor producing ATP from K+ gradient(15,000,000 processor hours for 200nanosecond simulation

2. Data Capture and Archiving

4. Modeling and Simulation

3. Data Analysis / Reduction

1. LIMS & Workflow Management

5. The Community Data Resource

Computing Capabilities for GTL Facilities and Projects

6. I

nfr

astr

uct

ure

CollaborativeProjects

Facilities

High-Performance Computing High-Performance Computing Roadmap for the Roadmap for the Genomics: GTL ProgramGenomics: GTL Program

Biological Complexity

ComparativeGenomics

Constraint-BasedFlexible Docking

1000 TF

100 TF

10 TF

1 TF*

Constrained rigid

docking

Genome-scale protein threading

Community metabolic regulatory, signaling simulations

Molecular machine classical simulation

Protein machineInteractions

Cell, pathway, and network

simulation

Molecule-basedcell simulation

*Teraflops

Current U.S. Computing

Genomics: GTL – A Vision of Genomics: GTL – A Vision of Systems Biology ResearchSystems Biology Research

In 10-15 years we would like to be able to start with a microbe or microbial community of interest and in a matter of days or weeks:

• Generate an annotated DNA sequence

• Produce proteins and molecular tags for most/all proteins

• Identify the majority of multi protein complexes

• Generate a working regulatory network model

• Identify the biochemical capabilities

• Design reengineering or control strategies in silico

Capabilities Needed:

• Map experimental strategies to distributed resources and instrument protocols

• Coordinate experimental process management across cyber collaboratories

• Track the process - sample tracking metadata

• Dynamically optimize experiment workflow

• Process and controls documentation / QA

• Localize problems with data production quality

• Share process data across facilities or projects

• Make production-scale collaborative science possible

1. LIMS and Workflow Management

Track and capture metadata

R & D Challenges and Technologies

Approaches to coordinated process design, optimization, protocol mapping for a large distributed enterprise

Explore LIMS and workflow management systems technology including commercial systems – modify?

Explore approaches to process documentation and control, QA/QC, and process metadata representation – make data reproducible

Develop Collaborative tools, electronic notebooks, web servers for shared access to laboratory data

1. LIMS and Workflow Management


•Capture bulk data and metadata from many different measurements and instruments in shared large-scale data archives

•Represent Complex Non-standard Data types: mass spectrometry, light microscopy, cryo EM, expression, biophysical & biochemical characterization data…

•Capture and represent data quality, statistical reliability measures, process metadata

•Support deposition, access, transfer and retrieval for archives of multi-petabyte size

Raw data sets

Swimming in Data


Developing representations and models for data and metadata from many different measurements and assays;

confocal images, video, mass spec, 3D Cryo-EM, . . .

Developing data exchange and format standards for facilities and the community

Hardware infrastructure for rapid and flexible access to very large (petabyte) data volumes. Research new data storage technologies.

Research approaches to design, query and retrieval efficiency in large datasets and with non-standard data types



Raw data sets


Process data from instruments such as mass spectrometers, microscopes, NMR, etc., to reduce and analyze data; e.g.;

•Automatically identify interacting protein events in FRET confocal microscopy

•Identify peptides, proteins, PTMs of interest in mass spectrometry data

•Quantitate changes in / cluster expression data from arrays or mass spectrometry

•Compare metabolite levels under different cell conditions

3. Data Analysis and Reduction

R & D Challenges and technologies

Many types of data, each with algorithm research and development challenges for analyzing data, basic algorithm research needed! e.g.;

- Automated processing of images and video about protein cell localization to achieve analysis high-throughput

- New mass spectrometry algorithms to identify post-translational modifications, cross-linked peptides, and new proteins (De Novo MS), and to automate quantitiation

- Analysis of NMR, Scattering, AFMs . .

Analysis throughput likely to be an issue; Research on Grid analysis approaches and codes for large clusters and MPP environments

Approaches to Tools Libraries and Repositories

Develop and adopt software engineering principles and practices for GTL software development; modular, open source

3. Data Analysis and Reduction


Build models of biology that capture our knowledge, based on a combination of experimental data types, and validate these models, use them to predict. e.g.;

•Build regulatory network topology from observations of protein expression based on conditions

•Build a protein-protein interaction network from protein interaction data of several types

•Build a model for the organization of a protein complex from homology modeling, geometry constraints from mass spec, and cryo-EM images

•Build cell models that combine regulation, metabolism and protein interactions


R & D Challenges

& TechnologiesSynthesis; How to infer or reconstruct systems

from data – build “optimal” model

Metabolic pathways from metabolic data &

genome

Regulatory networks from expression data

Protein interaction networks from binary interaction data

How to integrate different types of data into models

Integration of different imaging modalities

Integration of metabolism, regulation, and protein interactions into cell models

How to derive best interaction networks from raw binary interaction data, cell interaction images, predicted interactions, and co-expression data . . .


•Capture human modesof integrationto automate

R & D Challenges (cont’d)

How to mathematically represent biology – pathways, networks, communities

What’s the right calculus to describe regulation / metabolism / protein interaction networks / signaling / that allows quantitative prediction?

Differential equations?Stochastic or deterministic?Control theory or Ad hoc mathematical networks?Binary or discrete value networks?Chaos theory?“Need for new abstractions”

In what regimes do they work and where they fail?

How do we deal with missing data, incomplete knowledge, or errors?

Are there organizing principles or theory that could make us successful with incomplete knowledge?

How to get to longer compute times for physics based simulations (millisecs and beyond)- steer and sample



• Provide community access to data, models, simulations, and protocols for GTL. Allow users to query and visualize data, use models, run simulations.

• Community resources for multiple types of data - machines, interactions, process models, expression, regulation, genome annotation, metabolism, regulation,…

• Access to:• data• protocols and methods• analysis tools and user environments • models and simulations

• Access to multiple levels of data - raw data, processed results, dynamic models

• Integrated view of the biology represented

• Guide experimental design strategy for next microbe

“The GTL Knowledge Base”

5. Community Data Resource


Design and Integration of the major databases

Huge data volumes, great schema complexity - need for new types of databases (hardware and software)

Database technologies – object-relational, graph DBs, …Data standards, representations, ontologies for very complex objects

User Access Systems for browsing, query, visualization, and to run analysis or simulations

Supporting Simulation from DBs - how to allow users to utilize models and run simulations; how to link simulations to underlying data

Integration - Provide integrated view of the biology - With data from other community sources.

Community access to compute power to run long time-scale simulations

IP issues and reward system

How to represent incomplete, sparse, conflicting data

5. Community Data Resource

Objective: Provide hardware and software environments to support analysis, data storage, modeling and simulation activities required in GTL

Examples of Infrastructure:

• Hardware, network and operative system environments for peta-scale data storage and retrieval.

• Grid computing environments to support distributed large-scale data analysis operations.

• Massively parallel architectures for systems simulation.

• Discrete mathematics libraries

6. Infrastructure

http://DOEGenomesToLife.orghttp://DOEGenomesToLife.org