52
Ian Foster Accelerating data-driven discovery in energy science Distinguished Fellow

Accelerating Data-driven Discovery in Energy Science

Embed Size (px)

Citation preview

Page 1: Accelerating Data-driven Discovery in Energy Science

Ian Foster

Acceleratingdata-driven discovery in energy science

Distinguished Fellow

Page 2: Accelerating Data-driven Discovery in Energy Science

Life Sciences and Biology

Advanced MaterialsCondensed Matter

Physics

Chemistry and Catalysis

Soft Materials

Environmental and Geo Sciences

Can we determine pathways that lead to novel states and

nonequilibrium assemblies?

Can we observe – and control –

nanoscale chemical transformations in

macroscopic systems?

Can we create new materials with extraordinary properties – by engineering

defects at the atomic scale?

Can we map – and ultimately harness –

dynamic heterogeneity in complex correlated

systems?

Can we unravel the secrets of biological function – across length scales?

Can we understand physical and chemical processes in the most extreme environments?

2

New tools are needed to answer the most pressing scientific Qs

Page 3: Accelerating Data-driven Discovery in Energy Science

The resulting data delugeSpans biology, climate, cosmology, materials, physics, urban sciences, …

Simulation dataPetascale exascale simulations; simulation datasets as laboratories; high-throughput characterization; etc.

Experimental dataLight sources, genome sequencing, next-gen ARM radar, sky surveys, high-throughput experiments, etc.

New research methods that depend on coupling1) Of computation and experiment 2) Across data sources and types - inverse problems, computer control - knowledge integration, analysis

Page 4: Accelerating Data-driven Discovery in Energy Science

Scientific progress requirescollaborative discovery engines

informaticsanalysis

high-throughputexperiments

problemspecification

modeling and simulation

analysis &visualization

experimentaldesign

analysis &visualization

Integrateddatabases

Rick Stevens

Page 5: Accelerating Data-driven Discovery in Energy Science

Example: A discovery engine for disordered structures

Diffuse scattering images from Ray Osborn et al., Argonne

SampleExperimentalscattering

Material composition

Simulated structure

Simulatedscattering

La 60%Sr

40%

Detect errors (secs—mins)

Knowledge basePast experiments;

simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-drivendecision making

Evolutionary optimization

Page 6: Accelerating Data-driven Discovery in Energy Science

Acceleratingdata-driven discovery

in energy science

(1) Eliminate data friction

Page 7: Accelerating Data-driven Discovery in Energy Science

Eliminating data friction is essential to modern science

Civilization advancesby extending the number of important operations which we can perform without thinking about them (Whitehead, 1912)

Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)

Page 8: Accelerating Data-driven Discovery in Energy Science

Software as a service (SaaS) as lubricant

Customer relationship management (CRM):

A knowledge-intensive processHistorically, handled manually or via expensive, inflexible on-premise software

SaaS has revolutionized how CRM is consumed Outsource to provider who

runs software on cloud Access via simple interfaces Ease of use Cost Flexibility Complexity

Drag picture to placeholder or click icon to add

SaaSOn-premise

Page 9: Accelerating Data-driven Discovery in Energy Science

Globus: Research data management as a service

Essential research data management services File transfer Data sharing Data publication Identity and groups

Builds on 15 years of DOE research

Outsourced and automated High availability, reliability,

performance, scalability Convenient for

Casual users: Web interfaces Power users: APIs Administrators: Install, manage

globus.org

Page 10: Accelerating Data-driven Discovery in Energy Science

10

“I need to easily, quickly, & reliably move data to other locations.”

Research Computing HPC Cluster

Lab Server

Campus Home Filesystem

Desktop Workstation

Personal Laptop

DOE supercomputer Public Cloud

Page 11: Accelerating Data-driven Discovery in Energy Science

11

“I need to get data from a scientific instrument to my analysis system.”

Next GenSequencer

Light Sheet Microscope

MRI Advanced Light Source

Page 12: Accelerating Data-driven Discovery in Energy Science

12

“I need to easily and securely share my data with my colleagues.”

Page 13: Accelerating Data-driven Discovery in Energy Science

13

Globus and the research data lifecycle

Researcher initiates transfer request; or requested automatically by script, science gateway

1

InstrumentCompute Facility

Globus transfers files reliably, securely

2

Globus controls access to shared

files on existing storage; no need

to move files to cloud storage!

4

Curator reviews and approves; data set

published on campus or other system

7

Researcher selects files to share, selects user or group,

and sets access permissions

3

Collaborator logs in to Globus and accesses shared files; no local

account required; download via Globus

5

Researcher assembles data set;

describes it using metadata (Dublin core and domain-

specific)

6

6

Peers, collaborators search and discover datasets; transfer and share using Globus

8

Publication Repository

Personal Computer

Transfer

Share

Publish

Discover

• SaaS Only a web browser required

• Use storage system of your choice

• Access using your campus credentials

Page 14: Accelerating Data-driven Discovery in Energy Science

Globus at a glance

4 major services

13 national labs use Globus

services

100 PBpetabytes transferred

8,000 active endpoints

20 billion files processed

>300 users are active

daily

25,000 registered users

99.95% uptime over the past two years

>30 subscribers

The biggest transfer to date is

1 petabyte

The longest-running transfer to

date took

3 months

We’re eager to learn what

you want to do with Globus services

Page 15: Accelerating Data-driven Discovery in Energy Science

15

One APS node connects to125 locationsthru mid 2014

Page 16: Accelerating Data-driven Discovery in Energy Science

Same node(1 Gbps link)

Page 17: Accelerating Data-driven Discovery in Energy Science

Globus and DOE: Terabytes per month

Page 18: Accelerating Data-driven Discovery in Energy Science

Globus and DOE: Running total terabytes

Page 19: Accelerating Data-driven Discovery in Energy Science

Globus and DOE: Active users per month

Page 20: Accelerating Data-driven Discovery in Energy Science

Response has been gratifying"Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory

"Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates, Professional Research Assistant, NOAA

“…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing

"... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" - Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications

"We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory

"The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user

"I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley National Lab

"We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator, Oak Ridge National Laboratory

"Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user

"The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC support team, NCSA

 

 

Page 21: Accelerating Data-driven Discovery in Energy Science

21

Globus service APIs serve as a science platform

Identity, Group, andProfile Management

… Globus Toolkit

Glo

bus

API

s

Glo

bus

Con

nectData Publication & Discovery

File Sharing

File Transfer & Replication

Page 22: Accelerating Data-driven Discovery in Energy Science

Globus platform services enable new application capabilities

Page 23: Accelerating Data-driven Discovery in Energy Science

Publication as service for ACME

Page 24: Accelerating Data-driven Discovery in Energy Science

Globus platform accelerates development of new services

Page 25: Accelerating Data-driven Discovery in Energy Science

Operating a sustainable service

Globus is a not-for-profit service for researchers

We adopt a subscription- supported freemium modelSubscribers get extra features, rapid support

We’re engaged in crossing the chasm

Support from DOE will contribute to long-term success

Page 26: Accelerating Data-driven Discovery in Energy Science

Acceleratingdata-driven discovery

in energy science

(2) Liberate scientific data

Page 27: Accelerating Data-driven Discovery in Energy Science

Q: What is the biggest obstacle to data sharing in science?

A: The vast majority of data that is lost, or not online;if online, not described; if described, not indexedNot accessibleNot discoverableNot used

Contrast with common practice for consumer photos (iPhoto) Automated capture Publish then curate Processing to add value Outsourced storage

Page 28: Accelerating Data-driven Discovery in Energy Science

We must automate the capture, linking, and indexing of all data

Globus publication service encodes and automates data publication pipelines

Example application: Materials Data Facility for materials simulation and experiment data

Proposed distributed virtual collections index, organize, tag, & manage distributed data

Think iPhoto on steroids –backed by domain knowledge and supercomputing power

Drag picture to placeholder or click icon to add

Page 29: Accelerating Data-driven Discovery in Energy Science

We must automate the capture, linking, and indexing of all data

chiDB: Human-computer collaboration to extract Flory-Huggins ( ) parameters from 𝞆polymers literatureR. Tchoua et al.

Plenario: Spatially and temporally integrated, linked, and searchable database of urban dataC. Catlett, B. Goldstein, T. Malik et al.

Drag picture to placeholder or click icon to addDrag picture to placeholder or click icon to add

Page 30: Accelerating Data-driven Discovery in Energy Science

30

“I need to publish my data so that others can find it and use it.”

ScholarlyPublication

ReferenceDataset

Research CommunityCollaboration

Page 31: Accelerating Data-driven Discovery in Energy Science

Publish dashboard

31

Page 32: Accelerating Data-driven Discovery in Energy Science

Start a new submission

32

Page 33: Accelerating Data-driven Discovery in Energy Science

33

Describe submission: 1) Dublin Core

Page 34: Accelerating Data-driven Discovery in Energy Science

34

Describe submission: 2) Science metadata

Page 35: Accelerating Data-driven Discovery in Energy Science

Assemble the dataset

35

Page 36: Accelerating Data-driven Discovery in Energy Science

36

Transfer files to submission endpoint

Page 37: Accelerating Data-driven Discovery in Energy Science

37

Check dataset is assembled correctly

Page 38: Accelerating Data-driven Discovery in Energy Science

Submission now in curation workflow

38

Page 39: Accelerating Data-driven Discovery in Energy Science

Search published datasets

39

Page 40: Accelerating Data-driven Discovery in Energy Science

Search across collections

Page 41: Accelerating Data-driven Discovery in Energy Science

Discover a published dataset

41

Page 42: Accelerating Data-driven Discovery in Energy Science

Select a published dataset

42

Page 43: Accelerating Data-driven Discovery in Energy Science

View downloaded dataset

43

Page 44: Accelerating Data-driven Discovery in Energy Science

Configuring a publication pipeline: Publication “facets”

URL Handle DOIidentifier

none standard customdescription

domain-specific

none acceptance machine-validatedcuration

human-validated

anonymous Public collaboratorsaccess

embargoed

transient project lifetime “forever”preservation

archive

44

Page 45: Accelerating Data-driven Discovery in Energy Science

Acceleratingdata-driven discovery

in energy science

(3) Create discovery engines at DOE facilities

Page 46: Accelerating Data-driven Discovery in Energy Science

Recall: A discovery engine for disordered structures

Diffuse scattering images from Ray Osborn et al., Argonne

SampleExperimentalscattering

Material composition

Simulated structure

Simulatedscattering

La 60%Sr

40%

Detect errors (secs—mins)

Knowledge basePast experiments;

simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-drivendecision making

Evolutionary optimization

Page 47: Accelerating Data-driven Discovery in Energy Science

SimulationCharacterize,

PredictAssimilateSteer data acquisition

Data analysisReconstruct,

detect features, auto-correlate,

particle distributions, …

Science automation servicesScripting, security, storage, cataloging, transfer

~0.001-0.5 GB/s/flow~2 GB/s total burst~200 TB/month~10 concurrent flows(Today: x10 in 5 yrs)

IntegrationOptimize, fit, …

Configure CheckGuide

Batch

Immediate

0.001 1 100+PFlops

Precomputematerial

database

Reconstruct image

Auto-correlation

Feature detection

Scientific opportunities Probe material structure and

function at unprecedented scalesTechnical challenges Many experimental modalities Data rates and computation

needs vary widely; increasing Knowledge management,

integration, synthesis

Towards discovery engines for energy science (Argonne LDRD)

Page 48: Accelerating Data-driven Discovery in Energy Science

Linking experiment and computation

Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP).

Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.)Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime.

X-ray nano/microtomographyBio, geo, and material science imaging.(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response

2-BM

1-ID

6-ID

Populate

Sim Sim

Select

Sim

Microstructure of a copper wire, 0.2mm diameter

Advanced Photon Source

Experimental and simulated scattering from manganite

Page 49: Accelerating Data-driven Discovery in Energy Science

49

1: Run script (EL1.layer)2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

3: Transfer inputs

Compute facilities

4: Run app

6: Update catalogs

5: Transfer results

Externalcollaborators

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Researchers

Tying it all together: An energy sciences infrastructure

Page 50: Accelerating Data-driven Discovery in Energy Science

informaticsanalysis

high-throughputexperiments

problemspecification

modeling and simulation

analysis &visualization

experimentaldesign

analysis &visualization

Integrateddatabases

Summary: Big opportunities and challenges for energy data

Immediate opportunities Reduce data friction and

accelerate discovery by deploying Globus services across all DOE facilities

Develop new services to capture, link energy data

Important research agenda Discovery engines to answer

major scientific questions New research modalities

linking computation and data Organization and analysis of

massive science data

Drag picture to placeholder or click icon to add

Page 51: Accelerating Data-driven Discovery in Energy Science

51

Thank you to our sponsors!

U.S. DEPARTMENT OF

ENERGY

Page 52: Accelerating Data-driven Discovery in Energy Science

For more information: [email protected] to co-authors and Globus teamGlobus services (globus.org) Foster, I. Globus Online: Accelerating and democratizing science through

cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,

Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.

Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014.

Publication (globus.org/data-publication) Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,

Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015

Discovery engines Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,

M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.