40
www.ci.anl.gov www.ci.uchicago.edu Accelerating data- intensive science by outsourcing the mundane Ian Foster

Accelerating data-intensive science by outsourcing the mundane

Embed Size (px)

DESCRIPTION

Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!) Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.

Citation preview

Page 1: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

Accelerating data-intensive scienceby outsourcing the mundane

Ian Foster

Page 2: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

2

Alfred North Whitehead (1911)

Civilization advances by extending the number of important operations which we can perform

without thinking about them

Page 3: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

3

J.C.R. Licklider reflects on thinking (1960)

About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know

Page 4: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

4

For example … (Licklider again) At one point, it was necessary to compare six

experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.

Page 5: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

5

Publish results

Collectdata

Design experiment

Test hypotheses

Hypothesize explanation

Identify patterns

Analyzedata

Research hasn’t changed much in 300 years

Pose question

Page 6: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

6

Discovery 1960: Data collection dominates

Janet Rowley: chromosome translocations

and cancer

Page 7: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

7

800,000,000,000 bases/day30,000,000,000,000 bases/year

Discovery 2010: Data overflows

Page 8: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

8

42%!!

Meanwhile, we drown in administrivia

The Federal Demonstration Partnership’s faculty burden survey

Page 9: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

9

You can run a company from a coffee shop

Page 10: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

10

SaaS

PaaS

IaaS

Software

Platform

Infrastructure

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Varieties of “* as a Service” (*aaS)

Page 11: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

11

SaaS

PaaS

IaaS

Software

Platform

Infrastructure Amazon, GoGrid,Microsoft, Flexiscale, …

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Varieties of * as a service (*aaS)

Page 12: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

12

SaaS

PaaS

IaaS

Software

Platform

Infrastructure Amazon, GoGrid,Microsoft, Flexiscale, …

Google, Microsoft, Amazon, …

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Varieties of * as a service (*aaS)

Page 13: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

13

Perform important tasks without thinking

Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution IaaS

Page 14: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

14

Perform important tasks without thinking

Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution

SaaS

IaaS

Page 15: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

15

What about small and medium labs?

Page 16: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

16

Research IT is a growing burden

Big projects can build sophisticated solutions to IT problems

Small labs and collaborations have problems with both

They need solutions, not toolkits—ideally outsourced solutions

Page 17: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

17

Medium science: Dark Energy Survey

• Every night, they receive 100,000 files in Illinois

• They transmit these files to Texas for analysis (35 msec latency)

• Then move the results back to Illinois

• This whole process must run reliably & routinely

Image credit: Roger Smith/NOAO/AURA/NSF

Blanco 4m on Cerro Tololo

Page 18: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

18

Open transfer sockets vs. time

[Image: Don Petravick, NCSA]

Page 19: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

19

A new approach to research IT

Goal: Accelerate discovery and innovation worldwide by providing research IT as a service

Leverage software-as-a-service (SaaS) to• provide millions of researchers with

unprecedented access to powerful research tools, and

• enable a massive shortening of cycle times intime-consuming research processes

Page 20: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

20

Time-consuming tasks in science

• Run experiments• Collect data• Manage data• Move data• Acquire computers• Analyze data• Run simulations• Compare experiment

with simulation• Search the literature

• Communicate with colleagues

• Publish papers• Find, configure, install

relevant software• Find, access, analyze

relevant data• Order supplies• Write proposals• Write reports• …

Page 21: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

21

Time-consuming tasks in science

• Run experiments• Collect data• Manage data• Move data• Acquire computers• Analyze data• Run simulations• Compare experiment

with simulation• Search the literature

• Communicate with colleagues

• Publish papers• Find, configure, install

relevant software• Find, access, analyze

relevant data• Order supplies• Write proposals• Write reports• …

Page 22: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

22

A B

Discover endpoints, determine available protocols, negotiate firewalls, configure software,

manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …

Data movement can be surprisingly difficult

Page 23: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

23

Grid (aka federation) as a service

Globus ToolkitBuild the Grid

Components for building custom grid solutions

globustoolkit.org

Globus OnlineUse the Grid

Cloud-hostedfile transfer service

globusonline.org

Page 24: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

24

Globus Online’s Web 2.0 architecture

Fire-and-forget data movementMany files and lots of dataCredential managementPerformance optimizationExpert operations and monitoring

Web interface

HTTP REST interfacePOST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>

Command line interfacels alcf#dtn:/scp alcf#dtn:/myfile \ nersc#dtn:/myfile

GridFTP serversFTP servers

High-performancedata transfer nodes

Globus Connecton local computers

Page 25: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

25

Globus Connect to/from your laptop

25

Page 26: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

26

Almost always faster than other methods

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1E+09

gogucscptunedguc

Tran

sfer

rate

in b

ytes

/sec

0.001 0.01 0.1 1 10 100 1000Megabyte/fileArgonne NERSC

Page 27: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

27

Monitoring provides deep visibility

Page 28: Accelerating data-intensive science by outsourcing the mundane
Page 29: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

29

Globus Online runs on the cloud

Page 30: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

30

Data movers scale well on Amazon

Page 31: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

31

11 x 125 files200 MB each

11 users12 sites

SaaS facilitates troubleshooting

Page 32: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

32

Moving 586 Terabytes in two weeks

Page 33: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

33

NSF XSEDE architecture incorporatesGlobus Toolkit and Globus Online

33

XSEDE

Page 34: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

34

Publish results

Collectdata

Design experiment

Test hypotheses

Hypothesize explanation

Identify patterns

Analyzedata

Next steps: Outsource additional activities

Pose question

Page 35: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

35

A use case for the next steps

• Medical image data is acquired at multiple sites• Uploaded to a commercial cloud• Quality control algorithms applied• Anonymization procedures applied• Metadata extracted and stored• Access granted to clinical trial team• Interactive access and analysis• More metadata generated and stored• Access granted to subset of data for education

Page 36: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

36

Required building blocks

• Group management for data sharing– Scheduled September, 2011, for BIRN biomedical

• Metadata management– Create, update, query a hosted metadata catalog

• Data publication workflows– Data movement, naming, metadata operations, etc.

• Cloud storage access– And HTTP, WebDAV, SRM, iRODS, …

• Computation on shared data– E.g., via Galaxy workflow system

Page 37: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

www.globusoline.org

37

Page 38: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

38

Summary

• To accelerate discovery, automate the mundane

• Data-intensive computing is particularly full of mundane tasks

• Outsourcing complexity to SaaS providers is a promising route to automation

• Globus Online is an early experiment in SaaS for science

Page 39: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

39

For more information

• Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.

• Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.

Page 40: Accelerating data-intensive science by outsourcing the mundane

www.ci.anl.govwww.ci.uchicago.edu

Thank you!

[email protected]@uchicago.edu