23
OPEN DIALOGUE ON DIGITAL DATA MANAGEMENT Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 2010 1 Open Dialogue, Data Mgmt.

Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 1

OPEN DIALOGUE ON DIGITAL DATA

MANAGEMENT

Pat Burns, Dean

Dawn Paschal, Assistant Dean

CSU Libraries

October 13, 2010

Page 2: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 2

Background NSF requires proposals submitted as of Jan.

18, 2011 to include plans for data management: http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j

NIH & USDA also have similar requirements Other agencies looming: ‘Federal Research

Public Access Act’ Maximizing the value of data by sharing

DiscoverabilityAccessPreservationManagement

October 13, 2010

Page 3: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

3Open Dialogue, Data Mgmt.

Science ‘Then’ (5-10 years ago)

October 13, 2010

Theory

Computation

Experiment

Page 4: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

4Open Dialogue, Data Mgmt.

Science ‘Now’

October 13, 2010

Theory

Computation

Experiment

Data

Data

DataData

Data

Data

Data Data

Data

Data

Data

Page 5: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

5Open Dialogue, Data Mgmt.

Science ‘Emerging’

October 13, 2010

Theory Experiment

DataComputation

Page 6: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

6Open Dialogue, Data Mgmt.

Data

Data

DataData

Data

Data

Data

Data

Data

Data

Data

TheoryExperiment

DataComputation

Science 2.0 ‘Now’?

October 13, 2010

Page 7: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 7

Large Digital Data Sets Satellite imagery can generate > 1 petabyte (1015

bytes) of data per day!Supercomputers also generate massive data sets

Can we transport them? E.g., at 10 Gbits per second (note bits, not bytes: 1 byte = 8 bits)Time = 8x1015 bits/(1010 bits/sec) = 8x105 secs = 222

hours = 1 week, 2 days, 6 hours, 13 mins Can we store them?

Requires 500 ea. 2 TByte disks @ $250 ea. = $12,500; @ 5 year lifetime = $2,500/yr.

Requires 1 full rack in a data center: space, power, cooling, …

October 13, 2010

Page 8: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 8

Incoming!

An individual researcher can generate many data sets

We have many researchers who generate large data sets

Number: Many x many = Very many! Size: Very many x Very big = Enormous!

October 13, 2010

Page 9: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Now 2 Years 5 Years -

1,000

2,000

3,000

4,000

5,000

Research Data Storage Needed (TBytes)

219

1,384

4,914

Projected Needs (2009 CSU Survey)

October 13, 2010 9Open Dialogue, Data Mgmt.

CSU-DR = 3 Tbytes!!!

Page 10: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

10Open Dialogue, Data Mgmt.

How Can We Help?

October 13, 2010

ITLibraries •Storage capacity

•Transport capacity•Back-up•Sysadmin•IT security/privacy•Transcoding

•Data organization & structure•IP issues•Metadata•Discoverability•Preservation

JointOperations

Data/Info Stewards System Stewards

•The ‘front end’•Interactions w/ researchers

•The ‘back end’

Page 11: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 11

How Can We Help (cont’d)? Agreement upon a framework

Draft of a framework, present to faculty Language for our faculty to include in

their proposals Strategy, policy, procedures

Definition of work flow(s) Architectures for operations &

preservationBack-up vs. preservation, LOCKSS?

October 13, 2010

Page 12: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 12

Policies: The ‘Front End’

DRM: IP/ownership issues: data sets not ‘copyrightable’ (not creative works)But there may be local, institutional IP

policies that override thisNote that IP ≠ copyrightCreative Commons or Science Commons

licensing may applyAn embargo period is required

What are the preservation periods?

October 13, 2010

Page 13: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 13

NSB Data Type Definitions* Research collections (small, useful to

individuals/teams for life of a project, limited curation, standards typically lacking)

Resource collections (medium, useful to a community, follow group’s standards, mid- to long-term utility)

Reference collections (large, serve many segments of science/engineering, conform to robust standards, indefinite support)

*National Science Board

October 13, 2010

Page 14: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 14

Workflow Faculty provide data/information

Enter metadata, user’s manuals, select embargo period, select licensing options, enter pubs, point to or supply data sets, …

Librarians manage data/informationReview metadata, ingest and make accessible,

review periodically, deaccession periodically (annually?), manage data, interact w/ faculty

IT staff implement and operate systemsOperate system, backups , security, upgrading

storage, transport, move to LOCKSS, etc.

October 13, 2010

Page 15: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 15

Digital Assets - the 4 Pieces The Metadata, ideally on the CSU-DR

1. Typical, what we collect today, e.g. lightweight metadata (probably not copyrightable)

2. Contextual, e.g., user’s manuals (yes, copyrightable)

3. Scholarly publications associated with the data – ideally on the CSU-DR

4. The data itself – should be in the most appropriate place (pointers?)

October 13, 2010

Page 16: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 16

Digital Assets Management

October 13, 2010

1. Metadata 2. User’sManuals

Libraries-DR

4. Data Sets

Small LargeMedium

3. Pubs

LocalStorage

“The Cloud”“Pointers”

DisciplinaryRepositories,

SC Centers, etc.

Page 17: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

17Open Dialogue, Data Mgmt.

Architecture

October 13, 2010

Primary SystemThe Digital Repository

Preservation System

LOCKSS

High-speedNetworks

Page 18: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

CSU Storage Project

October 13, 2010 Open Dialogue, Data Mgmt. 18

45 TBytes (raw) for ~$8k

Page 19: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 19

Strategy for Storage of Data Sets Small, < 100 GB, we would agree to store

on the DR, but not forever Medium, we would agree to store on the DR

for a limited time at a cost, or on a local server somewhere and we point to it

Large, stored on a disciplinary DR somewhere, at a supercomputer center, or at a large instrument centerWe point to it (persistent URL?)

How do we deal with exceptions?

October 13, 2010

Page 20: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 20

What CSUL will Store & at What Cost

PRESERVATION PERIOD

SIZE

(+ means beyond end of grant period)

SMALL0.1 TB

MEDIUM0.1-10 TB

LARGE> 10 TB

Short (1 yr.+) Free Free Maybe +

Medium (2 yrs. +) Free $500/TB Maybe -

Long (> 5 yrs. +) $1,000/TB $1,000/TB No

October 13, 2010

Forever is a long time…..

Page 21: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 21

Needs Libraries-IT partnership Define policies for usage Define practice for usage

Definition of workflowsOperations

Develop needed toolsBuild an on-line, self-service submission tool +

requirements for review of user-created metadata Establish systems

Develop preservation infrastructure

October 13, 2010

Page 22: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 22

Issues Will the DR become a ‘Trusted Digital

Repository?’Will this enhance our proposals?

What will be stored where?Will disciplinary digital repositories emerge, e.g. at

NCAR and elsewhere?Flexibility is key

How best to engageThe VPR (probably already accomplished)The facultyLibrary staff: faculty and operational (DM Librarians at

UNM?)

October 13, 2010

Page 23: Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries October 13, 20101Open Dialogue, Data Mgmt

Open Dialogue, Data Mgmt. 23

Discussion

Is most welcome.

October 13, 2010