Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
28 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Key cyberinfrastructure elements implemented as RESTful webservices
Member Nodes
Service Interfaces
Bridge to non-DataONE Member Node services
Data Repository
Coordinating Nodes
Object Store Index
Coordination LayerIdentifiers
Preservation
Catalog
Monitor
Service InterfacesResolution Discovery
Replication Registration
Investigator Toolkit
Client LibrariesJava Python Command Line
Web Interface Data ManagementAnalysis, Visualization
SW repo at http://mule1.dataone.org/
29 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Goal: Uniquely identify data or metadata objects • Support the several identifier types widely used • Identifiers assigned by Member Nodes • Uniqueness ensured by Coordinating Nodes • Resolution through Coordinating Nodes
Identify objects
LSID PURL GUID!{3F2504E0-4…
30 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Identify people: federated identity
• Identity provider selected by the user
• Member nodes define access rules
• Rules propagated by Coordinating Nodes
• Identity and access control consistent across entire infrastructure
• (note similarity with Globus Online approach)
31 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Discover Content: ONEMercury
?
32 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Tools of interest: DataONE API’s
• RESTful interfaces give flexibility • Investigator toolkit consumes the REST API’s • Reference implementations: Investigator Toolkit (ITK) • Extensible
32
Investigator Toolkit
33 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
The Investigator Toolkit
• Developer, end-user tools • Creation, search, retrieval,
management • Plugins, extensions for
analysis tools
Inves=gator Toolkit
Web Interface Analysis, Visualiza6on Data Management
Client Libraries Java Python Command Line
Member Nodes Coordina6ng Nodes
Kepler
34 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
ONEDrive • Use DataONE discovery
REST service • Overlay with a Fuse
interface (implemented in Python)
• Result: a Posix file system interface to the entire DataONE set of collections
• (Note similarity to XSEDE GFFS)
• Caveat: beware the metadata issues: query
latency • Nice coupling with
ONEMercury faceted search to pare down metadata universe with selective mount commands
• Nice contrast to “schlepping files” for selected problems
• Useful for COTS SW that assumes a “file-open” dialogue
35 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 35
<facet>
<value> <iden6fier>
system metadata.xml
abstract.txt
<iden6fier>
describes
<iden6fier>
…
<value>
…
DataONE File System Structure
data_provider project title keywords decade
abscission abundance accretion …
knb-lter-gce.274.12 knb-lter-vcr.58.7 …
metadata data
36 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
One Drive Demo
37 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Other project tools (beyond SW stack)
38 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
DCXL: Excel extensions for data management
• Collaboration between CDL, Gordon and Betty Moore foundation, and Microsoft • http://dcxl.cdlib.org/
• Managed by Carly Strasser at CDL
• Examining ways to better capture metadata associated with Excel data sets – As an Excel add-in, or/and – As a web-service
• Recent slide deck: http://www.slideshare.net/carlystrasser/dcxl-lighttalk-at-pda2012
• Why Excel? Because it is widely used Willie Sutton Refer to Cliff Lynch this morning
“Because that’s where the data is”
39 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Data Management Planning Tool
https://dmp.cdlib.org/
• Create ready-to-use data management plans for specific funding agencies
• Meet funder requirements for data management plans • Get step-by-step instructions and guidance for your data management
plan as you build it • Learn about resources and services available at your institution to help
fulfill the data management requirements of your grant • Released: Oct. 2011 • Support for NIH requirements added 2/22/2012 • Other similar efforts now also underway at institutional levels or with
other entities. • Note: Invitation to “Data Management Boot Camp” with Dorothea Salo
in your symposium materials packet
40 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Support for Entire Data Lifecycle
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
Kepler
41 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
User Matrix
Dat
a
Ser
vice
Inve
stig
ator
To
olK
it
Dat
a M
anag
emen
t P
lann
ing
Bes
t P
ract
ices
Tool
s D
atab
ase
Trai
ning
Cur
ricul
a
Scientist
Data Librarians
Ecological Modeler
Resource Manager
42 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
How can this CI change the way we do science?
• A new cloud layer: DaaS Data-as-a-Service • Use REST interfaces to deploy DataONE services • Document and open Rest services so they can be used
collaboratively by new-partners and 3rd parties • Enable workflow mediated wide-area large scale
analysis (Caveat: you may crash them!) • Enable a bottoms-up standardization of services from
experienced based discovery of design patterns • Example: curation micro-services (tapas instead of
hoagies
43 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Provide Credit for Data Publication
• Data citation standards and courtesy customs • Needs to metrics – how often cited • Socio-cultural change: include data citations in promotion and tenure • DataONE needs to nurture Member Node needs not work against them
44 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 44 44
www.CitizenScience.org
Engaging citizens in science"
45 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
DataONE education and socio-cultural efforts
• ½ of DataONE project team is CEO oriented • Best practices web area http://www.dataone.org/best-practices
• Tenopir paper (Tenopir C, Allard S, Douglass K, Aydinoglu AU, Wu L, et al. 2011 Data Sharing by Scientists: Practices and Perceptions. PLoS ONE 6(6): e21101. doi:10.1371/journal.pone.0021101)
• Many workshops (SC’11 tutorial, ESA2011, ESA 2010, …)
46 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Interoperablity
• Some of the best science results from re-purposing data in unanticipated ways.
• Similar projects under different management need to interoperate
• Dissimilar projects need to interoperate • Multiple institutions need to interact for the best
science (SeWHIP and CTSI are good examples)
47 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Interoperability Shibboleths
• The future is NOT an accretion event • Data locality canard
– Bandwidth limits are real – But data will not all be generated in one location – Methods must be able to cope with non-local data retrieval
• The trouble with centers: Big-Data hostage crisis • Cloud services, in general, are not a means for
interoperability. Clouds are not necessarily VO friendly
48 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Data rights issue: hard problems
• Not just a technical issue • Currently in flux (c.f. RWA
kerfuffle) • Open is nice but not always
possible – Researcher and discipline cultural
behavior – External requirements (e.g.
HIPAA) – Commercial interests, IP-related
data
• Science is international but rights vary across political borders
• Technology may uncover new anti-social cultural norms
• Approach: evolve along with community of practice norms
• Change may only be on a generational scale
• Openness can be dictated by funder (some NIH successful examples)
• Rights on Metadata Cf. Clifford Lynch on PII data future reuse this morning
49 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Building global communities of practice: … creating long-lived CI enterprises,
• Broad, active community engagement – Involvement of library and science educators
engaging new generations of students in best practices – Existing outreach and education programs
• Transparent, participatory governance • Adoption/creation of innovative and sustainable business and
organizational models
cf. Clifford Lynch this morning and forward link to Serge Goldstein this afternoon
50 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
DataONE ultimate goals
• A one-stop shop for all aspects of data lifecycle
• A known, reliable data management brand • A resource enabling connections and
interoperability among many (often disparate) data repositories
• An advocate and educational resource for improving data management practices
• An recognized enabler for improved research data practices
• A productive partner for data repositories (Member Nodes)
51 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Closing plug: DataONE summer internships available • URL:
http://www.dataone.org/internships
• Available for Summer of 2012
• Time Period: May 23 – July 29, 2012
• Applications Due: March 12, 2012
• Up to 8 slots available
• Solicited topics: 1. Publish (data) or Perish: Best Practices for
Creating, Reviewing, and Publishing Data Products
2. Enriching the Content of the DMPTool for the DataONE Community
3. A Portable Web Application for Data and Metadata Submission
4. Querying Scientific Workflow Provenance 5. Data Usage and Citation Visualization 6. Evaluating the Feasibility of Using Bottom-Up
Text Mining Approaches to Complement Thesaurus and Ontology-based Approaches for Supporting Data Discovery
7. Enhancing Semantic Search in ONEMercury 8. An Information Model for Observational Data
within DataONE 9. Components of Successful Metadata Registry
Frameworks 10. Developing a DaaS (Data as a Service) view of
DataONE
52 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
DataONE Community
53 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Question & Discussion John W. Cobb 865.576.5439