26
28 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Key cyberinfrastructure elements implemented as RESTful webservices Member Nodes Service Interfaces Bridge to non-DataONE Member Node services Data Repository Coordinating Nodes Object Store Index Coordination Layer Identifiers Preservation Catalog Monitor Service Interfaces Resolution Discovery Replication Registration Investigator Toolkit Client Libraries Java Python Command Line Web Interface Data Management Analysis, Visualization SW repo at http://mule1.dataone.org/

Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

28 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Key cyberinfrastructure elements implemented as RESTful webservices

Member Nodes

Service Interfaces

Bridge to non-DataONE Member Node services

Data Repository

Coordinating Nodes

Object Store Index

Coordination LayerIdentifiers

Preservation

Catalog

Monitor

Service InterfacesResolution Discovery

Replication Registration

Investigator Toolkit

Client LibrariesJava Python Command Line

Web Interface Data ManagementAnalysis, Visualization

SW repo at http://mule1.dataone.org/

Page 2: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

29 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Goal: Uniquely identify data or metadata objects •  Support the several identifier types widely used •  Identifiers assigned by Member Nodes •  Uniqueness ensured by Coordinating Nodes •  Resolution through Coordinating Nodes

Identify objects

LSID PURL GUID!{3F2504E0-4…

Page 3: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

30 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Identify people: federated identity

•  Identity provider selected by the user

•  Member nodes define access rules

•  Rules propagated by Coordinating Nodes

•  Identity and access control consistent across entire infrastructure

•  (note similarity with Globus Online approach)

Page 4: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

31 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Discover Content: ONEMercury

?

Page 5: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

32 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Tools of interest: DataONE API’s

•  RESTful interfaces give flexibility •  Investigator toolkit consumes the REST API’s •  Reference implementations: Investigator Toolkit (ITK) • Extensible

32

Investigator Toolkit

Page 6: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

33 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

The Investigator Toolkit

•  Developer, end-user tools •  Creation, search, retrieval,

management •  Plugins, extensions for

analysis tools

Inves=gator  Toolkit  

Web  Interface   Analysis,  Visualiza6on   Data  Management  

Client  Libraries  Java   Python   Command  Line  

Member  Nodes   Coordina6ng  Nodes  

Kepler

Page 7: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

34 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

ONEDrive • Use DataONE discovery

REST service • Overlay with a Fuse

interface (implemented in Python)

• Result: a Posix file system interface to the entire DataONE set of collections

•  (Note similarity to XSEDE GFFS)

• Caveat: beware the metadata issues: query

latency • Nice coupling with

ONEMercury faceted search to pare down metadata universe with selective mount commands

• Nice contrast to “schlepping files” for selected problems

• Useful for COTS SW that assumes a “file-open” dialogue

Page 8: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

35 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 35  

<facet>  

<value>   <iden6fier>  

system  metadata.xml  

abstract.txt  

<iden6fier>  

describes  

<iden6fier>  

…  

<value>  

…  

DataONE File System Structure

data_provider project title keywords decade

abscission abundance accretion …

knb-lter-gce.274.12 knb-lter-vcr.58.7 …

metadata data

Page 9: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

36 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

One Drive Demo

Page 10: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

37 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Other project tools (beyond SW stack)

Page 11: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

38 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

DCXL: Excel extensions for data management

•  Collaboration between CDL, Gordon and Betty Moore foundation, and Microsoft •  http://dcxl.cdlib.org/

•  Managed by Carly Strasser at CDL

•  Examining ways to better capture metadata associated with Excel data sets –  As an Excel add-in, or/and –  As a web-service

•  Recent slide deck: http://www.slideshare.net/carlystrasser/dcxl-lighttalk-at-pda2012

•  Why Excel? Because it is widely used Willie Sutton Refer to Cliff Lynch this morning

“Because that’s where the data is”

Page 12: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

39 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Data Management Planning Tool

https://dmp.cdlib.org/

•  Create ready-to-use data management plans for specific funding agencies

•  Meet funder requirements for data management plans •  Get step-by-step instructions and guidance for your data management

plan as you build it •  Learn about resources and services available at your institution to help

fulfill the data management requirements of your grant •  Released: Oct. 2011 •  Support for NIH requirements added 2/22/2012 •  Other similar efforts now also underway at institutional levels or with

other entities. •  Note: Invitation to “Data Management Boot Camp” with Dorothea Salo

in your symposium materials packet

Page 13: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

40 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Support for Entire Data Lifecycle

Plan  

Collect  

Assure  

Describe  

Preserve  

Discover  

Integrate  

Analyze  

Kepler

Page 14: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

41 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

User Matrix

Dat

a

Ser

vice

Inve

stig

ator

To

olK

it

Dat

a M

anag

emen

t P

lann

ing

Bes

t P

ract

ices

Tool

s D

atab

ase

Trai

ning

Cur

ricul

a

Scientist

Data Librarians

Ecological Modeler

Resource Manager

Page 15: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

42 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

How can this CI change the way we do science?

• A new cloud layer: DaaS Data-as-a-Service • Use REST interfaces to deploy DataONE services • Document and open Rest services so they can be used

collaboratively by new-partners and 3rd parties • Enable workflow mediated wide-area large scale

analysis (Caveat: you may crash them!) • Enable a bottoms-up standardization of services from

experienced based discovery of design patterns • Example: curation micro-services (tapas instead of

hoagies

Page 16: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

43 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Provide Credit for Data Publication

•  Data citation standards and courtesy customs •  Needs to metrics – how often cited •  Socio-cultural change: include data citations in promotion and tenure •  DataONE needs to nurture Member Node needs not work against them

Page 17: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

44 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 44  44  

www.CitizenScience.org

Engaging citizens in science"

Page 18: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

45 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

DataONE education and socio-cultural efforts

• ½ of DataONE project team is CEO oriented • Best practices web area http://www.dataone.org/best-practices

•  Tenopir paper (Tenopir C, Allard S, Douglass K, Aydinoglu AU, Wu L, et al. 2011 Data Sharing by Scientists: Practices and Perceptions. PLoS ONE 6(6): e21101. doi:10.1371/journal.pone.0021101)

• Many workshops (SC’11 tutorial, ESA2011, ESA 2010, …)

Page 19: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

46 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Interoperablity

• Some of the best science results from re-purposing data in unanticipated ways.

• Similar projects under different management need to interoperate

• Dissimilar projects need to interoperate • Multiple institutions need to interact for the best

science (SeWHIP and CTSI are good examples)

Page 20: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

47 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Interoperability Shibboleths

•  The future is NOT an accretion event • Data locality canard

–  Bandwidth limits are real –  But data will not all be generated in one location –  Methods must be able to cope with non-local data retrieval

•  The trouble with centers: Big-Data hostage crisis • Cloud services, in general, are not a means for

interoperability. Clouds are not necessarily VO friendly

Page 21: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

48 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Data rights issue: hard problems

•  Not just a technical issue •  Currently in flux (c.f. RWA

kerfuffle) •  Open is nice but not always

possible –  Researcher and discipline cultural

behavior –  External requirements (e.g.

HIPAA) –  Commercial interests, IP-related

data

•  Science is international but rights vary across political borders

•  Technology may uncover new anti-social cultural norms

•  Approach: evolve along with community of practice norms

•  Change may only be on a generational scale

•  Openness can be dictated by funder (some NIH successful examples)

•  Rights on Metadata Cf. Clifford Lynch on PII data future reuse this morning

Page 22: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

49 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Building global communities of practice: … creating long-lived CI enterprises,

•  Broad, active community engagement –  Involvement of library and science educators

engaging new generations of students in best practices –  Existing outreach and education programs

•  Transparent, participatory governance •  Adoption/creation of innovative and sustainable business and

organizational models

cf. Clifford Lynch this morning and forward link to Serge Goldstein this afternoon

Page 23: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

50 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

DataONE ultimate goals

•  A one-stop shop for all aspects of data lifecycle

•  A known, reliable data management brand •  A resource enabling connections and

interoperability among many (often disparate) data repositories

•  An advocate and educational resource for improving data management practices

•  An recognized enabler for improved research data practices

•  A productive partner for data repositories (Member Nodes)

Page 24: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

51 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Closing plug: DataONE summer internships available • URL:

http://www.dataone.org/internships

• Available for Summer of 2012

•  Time Period: May 23 – July 29, 2012

• Applications Due: March 12, 2012

• Up to 8 slots available

• Solicited topics: 1.  Publish (data) or Perish: Best Practices for

Creating, Reviewing, and Publishing Data Products

2.  Enriching the Content of the DMPTool for the DataONE Community

3.  A Portable Web Application for Data and Metadata Submission

4.  Querying Scientific Workflow Provenance 5.  Data Usage and Citation Visualization 6.  Evaluating the Feasibility of Using Bottom-Up

Text Mining Approaches to Complement Thesaurus and Ontology-based Approaches for Supporting Data Discovery

7.  Enhancing Semantic Search in ONEMercury 8.  An Information Model for Observational Data

within DataONE 9.  Components of Successful Metadata Registry

Frameworks 10.  Developing a DaaS (Data as a Service) view of

DataONE

Page 25: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

52 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

DataONE Community

Page 26: Key cyberinfrastructure elements implemented as RESTful ... · Java Python Command Line Web Interface Analysis, Visualization Data Management ... Data r t Data s s Scientist Data

53 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Question & Discussion John W. Cobb 865.576.5439

[email protected]