26
Management, Analysis, and Visualization of Experimental and Observational Data The Convergence of Data and Computing Advanced Scientific Computing Advisory Committee Meeting 9/20 – 9/21/2016 Washington, D.C. E. Wes Bethel, Ph.D. Lawrence Berkeley National Laboratory

Management,)Analysis,)and)Visualization)of) Experimental

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Management,)Analysis,)and)Visualization)of) Experimental

Management, Analysis, and Visualization of Experimental and Observational Data

The Convergence of Data and Computing

Advanced ScientificComputing Advisory Committee Meeting 9/20– 9/21/2016 Washington,D.C.

E. WesBethel,Ph.D. Lawrence BerkeleyNationalLaboratory

Page 2: Management,)Analysis,)and)Visualization)of) Experimental

EOS: Multiple Exabytes of Data/Year

• Detectors andother sensors increasing in resolution and speed faster than processors and memory. • Science User Facilities (SUFs) are estimatingO(10s) PB/yr • Across SC SUFs inthe coming years, the aggregate forecastis multiple EB/yr. • Are we ready?

Incr

ease

ove

r 201

0

60

50

40

30

20

10

0 2010

Projected Rates

Sequencers Detectors Processors Memory

2011 2012 2013 2014 2015

Image: Kathy Yelick (LBNL)

Page 3: Management,)Analysis,)and)Visualization)of) Experimental

EOD Workshop Objectives • Better understand data-centricissues facing Office of Science Science User Facilities • Identify for meeting those needs and science objectives • Foster dialogue between EOS projects and ASCR

• September 29 – October 1, 2015

Page 4: Management,)Analysis,)and)Visualization)of) Experimental

Who Was At the Workshop? Science User Facility Lab Office

Environmental Molecular Sciences Lab

ClimateModeling (Computing)

Atmospheric Radiation Measurement Climate Research

Facility Advanced Light Source

Linac Coherent Light Source

Spallation Neutron Source High Flux Isotope Reactor

Scanning Tunneling Electron Microscopy

Advanced Photon Source

PNNL BER

PNNL

PNNL

BER

BER

LBNL

SLAC

ORNL

BES

BES

BES

Science User Facility Lab Office

Deep Underground Neutrino FNAL HEP Experiment

Cosmology/astrophysics JHU HEP (Computing)

Cosmic Frontier ANL HEP

Math/CS/Facilities - ASCR

Computing/Network Facilities: NERSC, OLCF, ALCF, ESnet

CS/math: data management, data analysis (ML, graph analytics, statistical analysis, etc.)/visualization, data mining,

operating systems/runtime, workflow, optimization,

ORNL

ANL

BES

BES

UI/human factors, applied mathematics

Guests

NSF Computational Facility

UK Science Grid

Page 5: Management,)Analysis,)and)Visualization)of) Experimental

Executive Summary • Gaining scientific knowledge from experimental data is increasingly difficult • Convergence of data and computing: data- and computing-centricneeds increasingly intertwined, symbiotic • Acute, urgent data-centricneeds in SUFs and science programs

Page 6: Management,)Analysis,)and)Visualization)of) Experimental

Finding: All EOS Projects Strugglewith aFlood of Data • How much data? Multiple exabytes/year • Science driver: better instruments Akey limitation today is

our [in]ability toanalyze • Challenges: and visualize the acquired

• More volume,complexity,variety(the V’s) data due to its volume, velocity, and variety. • Impediments to understandingdata

• Inadequate infrastructure,software B. Toby (ANL) Advanced Photon Source

• Opportunity cost: potential loss of science X-ray imaging/microscopy

Page 7: Management,)Analysis,)and)Visualization)of) Experimental

Finding: EOS Projects Have DifficultyUsing Large-Scale, HighPerformance ComputingFacilities • Insufficient resources: compute, storage, software, staffing, …

• EOS workloads/needs are different than traditional HPCworkloads: • Computational requirements • Data requirements

• Challenge:many HPCfacilities’operational policies optimizedfor high-concurrency, batch workloads

Procedures for moving data from place to place, including tools forautomating resilient workflow fororchestrating distributed data-related operations are abottleneck.

P. Rasch (PNNL), Climate modeling and analysis

Page 8: Management,)Analysis,)and)Visualization)of) Experimental

Finding: EOS Projects HaveUnmet Time-Critical Data Needs • EOS projects require low-latency,high-throughputresponse • Drivers:

• (Convergence topic) Experiment optimization/tuning: near real-time analysisof experimentresults(using HPC platforms) to adjustexperiment parameters to obtain better data.

• Increasing throughput to keep pace with data acquisition rates.

• Challenges: • Local vs. remote computing resources • Solutionsneed to accommodate both local and remote resources • Algorithms and software infrastructure designed for workloads ofthe lastdecade (or lastcentury) likely inadequate to accommodate these data loadsand response times.

Predicting optimal [experiment]parameters could optimize data collection schemes and ultimately provide better quality data.

B. Toby (ANL) Advanced Photon Source X-ray imaging/microscopy

Page 9: Management,)Analysis,)and)Visualization)of) Experimental

Finding: TheRisk ofUnusableData

• Without adequate metadata, scientific data haslimited usefulness • Data-centrictasks left to the individual user • Drivers • Compliance with Executive Orderfordatadissemination • Opportunityfornew science unforeseen byoriginalexperimentalistswhen dataisshared,reused

• Challenge:metadatais an afterthought

One very real problem is thatdata is almostneverusable by anyone other than the person who producedit. This problem must be solved if making data publicly available is to have any useful purpose.

H. Steven Wiley (PNNL) Environmental Molecular Sciences Laboratory

Page 10: Management,)Analysis,)and)Visualization)of) Experimental

Finding: Collaboration and Sharing areCentral to EOS Projects • Sharing and collaboration are central to modern EOS

• Driver: Share data, methods andtools withthe broader scientific community

• Challenge: Ad hoc approaches impede sharing

• Opportunities • Reduce costs,betterscience

Current technologies areinadequate forsharing data betweengroupmembers. The community needs a more fluidmeans for sharing data and working together.

H. Steven Wiley (PNNL) Environmental Molecular Sciences Laboratory

Page 11: Management,)Analysis,)and)Visualization)of) Experimental

Finding: EOS DataLifecycleNeeds Not Being Met • EOS data lifecycle needs are complex • DAQ through data publication and data products

• Drivers: • EOS mission is to collect data and share it • Reference datasets

• Challenges:• No program-wide approach formeeting needs

• Opportunities • Potentialto increase science discoveriesperexperiment

..our only archival process rightnow isthatprovided by the published journal.

Providing more access to data, in a manner that can be usedby more scientists,will improve efficiency, increase scientific impact,and result in more discoveries per experiment.

G. Granroth andT. Proffen (ORNL) Spallation Neutron Source

through datasharing,reuse (reference datasets)

Page 12: Management,)Analysis,)and)Visualization)of) Experimental

Finding: SoftwareHas aCentral Rolein EOS Projects • Software central in all aspects of working with EOD

• Drivers: • EOS vulnerable to inefficiencies,increased costs that can result from software-related issues

• Challenges:• How to create the scientific software needed to run the science facility?

• Insufficientprogram-wide visibility,coordination

• Opportunities • Potentialforbroad,positive impact on EOS

The most serious impedimentthe APS encounters is a lack of a DOE-wide view of software needs across the BES mission. Since each lab has its own portfolio of responsibilities, it devotes resources to those goals

B. Toby (ANL) Advanced Photon Source X-ray imaging/microscopy

Page 13: Management,)Analysis,)and)Visualization)of) Experimental

Finding: WorkforceDevelopment and Retention Concerns • Our personnel: our single most precious resource

• Challenges: • Specialized and multi-disciplinary knowledge is rare • Data scientistsin high demand • Insufficient orinadequate careerpaths

• Opportunities: • EOS is a problem-rich environment

Many data-centricproblems require close interaction between the domainscientist anddata scientist. Generally,such dual backgroundis the exception and some amount of professional training is required to fillthis gap.

S. Kalinin (ORNL) Scanning probe and scanning transmission electron microscopy

Page 14: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Address Challenges Posed by Growing Data Size, Rate, Complexity • Data infrastructure: modernize data infrastructure to accommodate present and future EOS data size, rates • Math and CS research: multidisciplinary, and targeting key EOS data challenges • Software: cultivate multidisciplinary teams focused on EOS software development and deployment

Akey limitation today is our [in]ability toanalyze and visualize the acquired data due to its volume, velocity, and variety.

B. Toby (ANL) Advanced Photon Source X-ray imaging/microscopy

Page 15: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Evolve HPC Facilities to Include Focus on EOS Needs • Identify and prioritize EOS needs • Evaluate strategies for accommodating EOS project needs (convergence, data lifecycle) • Refine operational policies for EOS workloads • Reconsider facility metrics, hardware & software implications • Considerapproaches forproviding resources,and use policies, attunedfor EOS data needs

Need for community-(or facility-)centricdata repository for data archival, sharing; with substantial bandwidth to the stored data,and easy interface forinteracting with the data analytics… needs to be massively parallel, a combination of visualization and various analysis tools.

Alex Szalay Johns Hopkins University

Page 16: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Develop Solutions for EOS Data-Centric Workloads • Assess andunderstandEOS needs, use patterns, performance limits • Assess applicability of relatedfactors • Developrequirements, strategies, policies aimed atEOS workloads • Cultivate and adoptnew methods forspecialized EOS use caseslike time-critical processing

...it is getting to point where users cannot just downloadtheir data— theirhard drive isn’t big enough, and if it was they wouldn’thave the computing power neededto do anything with it.

D. Parkinson (LBNL) Advanced Light Source

Page 17: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Improve EOS Productivity with Resilient, Automated Data Pipelines • Developunderstanding of commonpatterns in EOS

• Developsolutions for optimizing across bothquantitative andqualitative metrics • E.g.,performance,ease-of-use

• Promote reusability of workflowsand methods across EOS projects andHPC facilities

Many instruments produce very large and complex data streams thatremain unmined … and are quickly becoming the largestfraction of ourdata by volume … because they must be manually interpreted by experts fordata quality and meaning.

L. Riihimaki andC. Sivaraman (PNNL) Atmospheric Radiation Measurement Climate Research Facility

Page 18: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Address Metadata Needs of the EOS Community • How is data to be used by EOS?

• Focus R&D methods to enable sharing, understanding, search, curationandrelatedEOS needs • Strive towards automated metadata capture • Make it “part of the culture,part of the practice”

• R&D&Dof tool sets for capturing, storing, managing metadata

Truly reusable data requires a significant amount of associated metadata…the overall cost and complexity of metadata recording and consolidation is currently prohibitive, whichis the primary reasonitis rarely collected.

H. Steven Wiley (PNNL) Environmental Molecular Sciences Laboratory

Page 19: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Expand Capabilities for Collaboration and Sharing of EOSTools & Data • How do EOS projects need/want to share data?

• Developapproaches for sharing data, findingdata, tools, and related resources • Deploy tools that are data-/metadata-driven, andeasy to use

Providing more access to the data,in a mannerthatcan be used by more scientists,will improve efficiency,increase the impactof the science,and result in more papers per experiment.

G. Granroth &T. Proffen (ORNL) Spallation Neutron Source andHighFlux Isotope Reactor

Page 20: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Identify and Fill Gaps in DataLifecycle,Reproducibility and Curation • Developunderstanding of needs/uses, assess what’savailable

• Developa roadmapfor R&Dthat is broadly applicable across SCEOS projects • Developstrategies for provisioning infrastructure for data needs • Assess possibility of provisioning DOE-SC-wide facilities for data storage, archival, sharing

A welcome addition in the data universe wouldbe a centralized DOE facility thatprovides … a mechanismfor data archival and retrieval.

B. Toby (ANL) Advanced Photon Source X-ray imaging/microscopy

Page 21: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Expand Efforts Focusing on EOS Software Ecosystem

The most serious • Developandunderstanding of broadanddiverse EOS impedimentthe APS

data-centricsoftware needs encounters is a lack of a DOE-wide view

• Cultivate software R&D projects forEOS thatfollow of software needs across the BES mission. Since best practices andis broadly applicable, usable each lab has its own

• Identify opportunities for broader coordination of EOS portfolio of responsibilities, it devotes data software design, development, deployment resources to those goals

• Identify and establish practices for software archival, B. Toby (ANL) curation, dissemination, and long-term support. Advanced Photon Source

X-ray imaging/microscopy

Page 22: Management,)Analysis,)and)Visualization)of) Experimental

Recommendation: Develop and Nurture a DataScienceWorkforce • Prioritize the role of data science activitiesin multidisciplinary teams

… Lack of trained

• Recognize, reward the roles and skills of this manpower and career paths for computationally-group oriented scientists

• Provide career paths S. Habib (ANL) Cosmic Frontier Use Cases

Page 23: Management,)Analysis,)and)Visualization)of) Experimental

Convergence: EOSand Simulation Science Increasingly Intertwined and Inseparable • Better computing resultsin better data, better data resultsin better computing

• Diversity of EOS use cases, ”computing” refers to simulationscience, and alotmore

• Can’tdo data-intensive science without computing

Page 24: Management,)Analysis,)and)Visualization)of) Experimental

Additional Topics - Backup Slides

• Methodology for collecting SUF data needs • What happened at the workshop

• EOS Science Use Cases

Page 25: Management,)Analysis,)and)Visualization)of) Experimental

Summary and Final Thoughts

• All EOS projects face significant, complex data challenges • Theyare tackling these issues“on theirown” • There is there is no program-wide,coordinated effort to addressthem

• Urgent andacute needs: multiple exabytes/year are coming soon and EOS projects are not ready

• The workshop report has a lot more depth

Page 26: Management,)Analysis,)and)Visualization)of) Experimental