20
CLOUD DATAVERSE Mercè Crosas, Institute for Quantitative Social Science, Harvard University @mercecrosas MOC WORKSHOP, OCTOBER 3, 2017, BOSTON UNIVERSITY

CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

CLOUD DATAVERSE

Mercè Crosas, Institute for Quantitative Social Science, Harvard University

@mercecrosas

MOC WORKSHOP, OCTOBER 3, 2017, BOSTON UNIVERSITY

Page 2: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

OUR INSTITUTE PROVIDES ATECHNOLOGY SOLUTION TO

DATA SHARING

Institute for Quantitative Social Science, Harvard University@IQSS

Page 3: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

An open-source software to share, cite, and find data.

Developed at Harvard's Institute for Quantitative Social Science

with the contribution of an active and growing community.

Page 4: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

2006 (we started) 2017

dataverse.org

26 Dataverse installations servinghundreds of institutions

Page 5: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

HOW RESEARCHERS SHARE & USE DATA WITH DATAVERSE

Harvard Dataverse RepositoryA public repository for research data > 70,000 datasets total > 49,000 datasets uploaded toHarvard Dataverse repository200 datasets/month > 340,000 files4,000 files/month > 2.5 M downloads60,000 downloads/month

Datasets Added

Downloads

dataverse.harvard.edu

Page 6: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

King, 1995, Replication,Replication

Altman and King, 2007, A Proposed Standard forthe Scholarly Citation of Quantitative Data

Altman et al, 2001, A Digital Library for the Disseminationand Replication of Quantitative Social Science

King, 2007, An Introduction to the DataverseNetwork as an Infrastructure for Data Sharing

Crosas, Honaker, King, Sweeney, 2015,Automating Open Science for Big Data

Crosas, 2012, The Dataverse Network: an open sourceapplication for sharing, discovering, and preserving research

data

Altman and Crosas, 2013, The Evolution to DataCitation: from principles to implementation

Crosas, 2013, A Data Sharing Story

2014, Joint Declaration of DataCitation Principles

Pepe et al, 2014, How Do Astronomers Share Data?

Goodman et al, 2014, Ten Simple Rules forthe Care and Feeding of Scientific Data

Castro et al, 2015, Achieving Human andMachine Accessibility of Cited Data

Sweeney, Crosas, Bar-Sinai, 2015, Sharing SensitiveData with Confidence: The DataTags System

Meyer et al. 2016, Data Publication with the Structural Biology Data Grid Supports Live Analysis

Wilkinson et al, 2016, The FAIRGuiding Principles for Scientific

Data Management andStewardship

Bierer, Crosas, Pierce, 2017, DataAuthorship as an Incentive to

Data Sharing

OUR CONTRIBUTIONS TO ENHANCE DATA SHARING

2017

Page 7: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

FINDABLEACCESSIBLE INTERPOPERABLEREUSABLE

Data should be ...

Wilkinson et al. , 2016, "The FAIR Guiding Principles for Scientific Data Management and Stewardship"

Nature Scientific Data

Page 8: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

FAIR DATA IN DATAVERSE

Data Files

Metadata

Data Licenses,User Agreements,

Restrictions

Data Citationwith Persistent

Identifier

Versions

APIs

Page 9: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

+

Cloud Dataverse combines the power of cloud computing andstorage with access to thousands of datasets from a feature-richdata repository platform

Page 10: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

WHY CLOUD DATAVERSE?

Big Data should also be FAIR Data

Datasets are replicated to the Cloud for efficient access and reuse

Computing on a dataset is enabled directly from any repository

Page 11: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions
Page 12: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

WHAT WE HAVE BUILTDataverse integration with Swift storageCompute access to MOC from a dataset page in DataverseTemporary url to access restricted files in MOC

IN PROGRESS

Replicate data from any Dataverse to Cloud DataverseUpload data directly in Swift; publish dataset from Swift to Dataverse

NEXT

Implement Swift Access Control List (ACL) for file restrictionSupport InCommon for MOC to use same credentials as in Dataverse

Page 13: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

INTEGRATION WITHOTHER PROJECTS

Page 14: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

BILLION OBJECT PLATFORMBIG GEODATA EXPLORATION AND ANALYTICS

Page 15: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions
Page 16: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

DATA PROVENANCETRACK THE ORIGINAL SOURCE OF A DATASET

Page 17: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

Pasquier, Lau, Trisovic, Boose, Coutierer, Crosas, Ellison, GIbson, Jones, Seltzer, 2017, If These Data Could Talk, Nature Scientific Data

(Data Provenance examples from CERN and Harvard Forest)

Page 18: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

DATA PRIVACYCLASSIFY AND HANDLE DATASETS BASED ON

THEIR PRIVACY LEVEL

Page 19: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

Harvard Data Privacy Tools Project: privacytools.seas.harvard.edu

DataTags Project: datatags.org

Page 20: CLOUD DATAVERSE - Harvard University · FAIR DATA IN DATAVERSE Data Files Metadata Data Licenses, User Agreements, Restrictions Data Citation with Persistent Identifier Versions

Text

THANKS@mercecrosas

@iqss

scholar.harvard.edu/mercecrosas

dataverse.org