24
Ian Foster Computation Institute Argonne National Lab & University of Chicago New computing platforms for data-intensive science

Aaas Data Intensive Science And Grid

Embed Size (px)

DESCRIPTION

These slides were presented in a session that we organized at the American Association for Advancement of Science (AAAS) meeting in Chicago, February 2009.Abstract: New laboratory devices, sensor networks, high-throughput instruments, and numerical simulation systems are producing data at rates that are both without precedent and rapidly growing. The resulting increases in the size, number, and variety of data are revolutionizing scientific practice. These changes demand new computing infrastructures and tools. Until recently, most laboratories and collaborations managed their own data, operated their own computers, and used remote high-performance computers only when required. We are moving to a paradigm in which data will primarily be located and managed on remote clusters, grids, and data centers. In this symposium, we will examine the computing infrastructure designed to serve this emerging era of data-intensive computing from three perspectives: (1) that of grid computing, which enables the creation of virtual organizations that can share remote and distributed resources over the Internet; (2) that of data centers, which are transitioning to providers of integrated storage, data, compute, and collaboration services (the offering of one or more of these integrated services over the Internet is beginning to be called cloud computing); and (3) that of e-science, in which grids, Web 2.0 technologies, and new collaboration and analysis services are merging and changing the way science is conducted. Each speaker will focus on one perspective but also compare and contrast with the others.

Citation preview

Page 1: Aaas Data Intensive Science And Grid

Ian Foster

Computation Institute

Argonne National Lab & University of Chicago

New computing platforms

for data-intensive science

Page 2: Aaas Data Intensive Science And Grid

3

Page 3: Aaas Data Intensive Science And Grid

4

Growth of Genbank

(1982-2005)

BroadInstitute

Page 4: Aaas Data Intensive Science And Grid

5

Proteomics Genomics Transcriptomics Protein sequence prediction Phenotypic studies Phylogeny Sequence analysis Protein structure prediction Protein-protein interaction Metabolomics Model organism collections Systems biology Health epidemiology Organisms Disease ….

1070 molecular bio databases Nucleic Acids Research Jan 2008

(96 in Jan 2001)

Slide: Carole Goble

Page 5: Aaas Data Intensive Science And Grid

6

New problem solving methodologies

<0 1700 1950 1990

Empirical

Data

Theory

Simulation“Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences”

– G. Djorgovski

Page 6: Aaas Data Intensive Science And Grid

7

Page 7: Aaas Data Intensive Science And Grid

8

More data does not always mean more knowledge

Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.

Page 8: Aaas Data Intensive Science And Grid

9

enormous

Data is

Infrastructure Storage & computingEconomics of scale

AggregationData & softwarePeople & disciplines

AlgorithmsScalable, probabilisticErrors & ambiguity

distributed

noisy

Cloud

Grid

Page 9: Aaas Data Intensive Science And Grid

10

Data

An incomplete list of process steps

Discover

Access

Integrate

Analyze

Mine

Publish

Annotate

Validate

CurateShare

Artisanal

Industrial

Data

Analyses

Models

Experiments

Literature

Page 10: Aaas Data Intensive Science And Grid

11

SOA as an integrating framework?

We expose data and software as services …

which others discover, decide to use, …

and compose to create new functions ...

which they publish as new services.

Technical …• Complexity• Semantics• Distribution• Scale

socio-technical challenges• Incentives• Policy, trust• Reproducibility• Life cycle

“Service-oriented science”, Science, 2005

and

Page 11: Aaas Data Intensive Science And Grid

12

Grid technology

Page 12: Aaas Data Intensive Science And Grid

13

NAE Grand Challenges

13

Page 13: Aaas Data Intensive Science And Grid

14

The future of multi-site data integration: An example

fMRI

Are positive symptom schizophrenics associated with more severe superior temporal gyrus dysfunction?

Receptor Density

ERP

Web

PubMed, Expasy,

Brain Map,Etc.

Structure

Clinical

PortalPortal

0.150.18

0.140.11

-0.14-0.10-0.06-0.020.020.060.100.140.180.220.260.30

ARIP - 20MG ARIP - 30MG RISP - 06MG PLACEBOTreatment Group

Page 14: Aaas Data Intensive Science And Grid

15

caBIG: sharing of infrastructure, applications, and data.

Aggregation in cancer biology

Globus

Page 15: Aaas Data Intensive Science And Grid

16

As of Feb16, 2009

123 participants104 services

65 data39 analytical

Page 16: Aaas Data Intensive Science And Grid

17

Microarray clustering in caBIG

1. Query and retrieve microarray data from a caArray data service:cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub

2. Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService

1. Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage

Workflow in/output

caGrid services

“Shim” servicesothers

Wei Tan(Taverna workflow)

Page 17: Aaas Data Intensive Science And Grid

18

Children’s Oncology Groupclinical imaging irials (Erberich)

Page 18: Aaas Data Intensive Science And Grid

19

Wide-area medical interface service

Converts local medical workflow actions into wide area operations Image workflow, EHR, …

Transparently manages federation of Security Data replication and recovery Data discovery

En

terp

rise/G

ridIn

terfa

ce S

erv

ice

DICOM Protocols

Grid Protocols(Web services)

DICOM

XDS

HL7

Vendor Specific

Wid

e A

rea

Serv

ice A

ctor

Plug-in Adapters

Page 19: Aaas Data Intensive Science And Grid

20

Main ESG PortalMain ESG Portal CMIP3 (IPCC AR4) ESG PortalCMIP3 (IPCC AR4) ESG Portal

198 TB of data at four locations 1,150 datasets 1,032,000 files Includes the past 6 years of joint

DOE/NSF climate modeling experiments

35 TB of data at one location 74,700 files Generated by a modeling campaign coordinated by the

Intergovernmental Panel on Climate Change Data from 13 countries, representing 25 models

8,000 registered users 1,900 registered projects

Downloads to date 49 TB 176,000 files

Downloads to date 387 TB 1,300,000 files 500 GB/day

(average)

400 scientific papers published to date based on analysis of CMIP3 (IPCC AR4) data

Earth System Grid

ESG usage: over 500 sites worldwide

ESG monthly download volumes

Globus

www.earthsystemgrid.org

Page 20: Aaas Data Intensive Science And Grid

21

Understanding interactions between human and natural systems

IPCC Emissions scenarios

Numerical Simulations

IPCC 4th Assessment

2007

IPCC process: Bill Collins, LBNL

Mitigation

Adaptation

Page 21: Aaas Data Intensive Science And Grid

22

A Community Integrated Model for Economic and Resource Trajectories for

Humankind (CIM-EARTH)

Dynamics,foresight,

uncertainty,resolution, …

Agriculture,transport,

taxation, …

Data (global,local, …)

(Super)computers

CIM-EARTHFramework

Communityprocess

Opencode, data

www.cim-earth.org

Page 22: Aaas Data Intensive Science And Grid

23

Alleviating Poverty

in Thailand:Modeling

Entrepreneurship

Consider only wealth,

access to capital

Consider alsodistance to

6 major cities

Rob Townsend, Tibi Stef-Praun, Victor Zhorin

Match

High

Low

Page 23: Aaas Data Intensive Science And Grid

24

enormous

Data is

Infrastructure Storage & computingEconomics of scale

AggregationData & softwarePeople & disciplines

AlgorithmsScalable, probabilisticErrors & ambiguity

distributed

noisy

Cloud

Grid

Page 24: Aaas Data Intensive Science And Grid

Computation Institutewww.ci.uchicago.edu

Thank you!