Open Dialogue, Data Mgmt. 1
OPEN DIALOGUE ON DIGITAL DATA
MANAGEMENT
Pat Burns, Dean
Dawn Paschal, Assistant Dean
CSU Libraries
October 13, 2010
Open Dialogue, Data Mgmt. 2
Background NSF requires proposals submitted as of Jan.
18, 2011 to include plans for data management: http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j
NIH & USDA also have similar requirements Other agencies looming: ‘Federal Research
Public Access Act’ Maximizing the value of data by sharing
DiscoverabilityAccessPreservationManagement
October 13, 2010
3Open Dialogue, Data Mgmt.
Science ‘Then’ (5-10 years ago)
October 13, 2010
Theory
Computation
Experiment
4Open Dialogue, Data Mgmt.
Science ‘Now’
October 13, 2010
Theory
Computation
Experiment
Data
Data
DataData
Data
Data
Data Data
Data
Data
Data
5Open Dialogue, Data Mgmt.
Science ‘Emerging’
October 13, 2010
Theory Experiment
DataComputation
6Open Dialogue, Data Mgmt.
Data
Data
DataData
Data
Data
Data
Data
Data
Data
Data
TheoryExperiment
DataComputation
Science 2.0 ‘Now’?
October 13, 2010
Open Dialogue, Data Mgmt. 7
Large Digital Data Sets Satellite imagery can generate > 1 petabyte (1015
bytes) of data per day!Supercomputers also generate massive data sets
Can we transport them? E.g., at 10 Gbits per second (note bits, not bytes: 1 byte = 8 bits)Time = 8x1015 bits/(1010 bits/sec) = 8x105 secs = 222
hours = 1 week, 2 days, 6 hours, 13 mins Can we store them?
Requires 500 ea. 2 TByte disks @ $250 ea. = $12,500; @ 5 year lifetime = $2,500/yr.
Requires 1 full rack in a data center: space, power, cooling, …
October 13, 2010
Open Dialogue, Data Mgmt. 8
Incoming!
An individual researcher can generate many data sets
We have many researchers who generate large data sets
Number: Many x many = Very many! Size: Very many x Very big = Enormous!
October 13, 2010
Now 2 Years 5 Years -
1,000
2,000
3,000
4,000
5,000
Research Data Storage Needed (TBytes)
219
1,384
4,914
Projected Needs (2009 CSU Survey)
October 13, 2010 9Open Dialogue, Data Mgmt.
CSU-DR = 3 Tbytes!!!
10Open Dialogue, Data Mgmt.
How Can We Help?
October 13, 2010
ITLibraries •Storage capacity
•Transport capacity•Back-up•Sysadmin•IT security/privacy•Transcoding
•Data organization & structure•IP issues•Metadata•Discoverability•Preservation
JointOperations
Data/Info Stewards System Stewards
•The ‘front end’•Interactions w/ researchers
•The ‘back end’
Open Dialogue, Data Mgmt. 11
How Can We Help (cont’d)? Agreement upon a framework
Draft of a framework, present to faculty Language for our faculty to include in
their proposals Strategy, policy, procedures
Definition of work flow(s) Architectures for operations &
preservationBack-up vs. preservation, LOCKSS?
October 13, 2010
Open Dialogue, Data Mgmt. 12
Policies: The ‘Front End’
DRM: IP/ownership issues: data sets not ‘copyrightable’ (not creative works)But there may be local, institutional IP
policies that override thisNote that IP ≠ copyrightCreative Commons or Science Commons
licensing may applyAn embargo period is required
What are the preservation periods?
October 13, 2010
Open Dialogue, Data Mgmt. 13
NSB Data Type Definitions* Research collections (small, useful to
individuals/teams for life of a project, limited curation, standards typically lacking)
Resource collections (medium, useful to a community, follow group’s standards, mid- to long-term utility)
Reference collections (large, serve many segments of science/engineering, conform to robust standards, indefinite support)
*National Science Board
October 13, 2010
Open Dialogue, Data Mgmt. 14
Workflow Faculty provide data/information
Enter metadata, user’s manuals, select embargo period, select licensing options, enter pubs, point to or supply data sets, …
Librarians manage data/informationReview metadata, ingest and make accessible,
review periodically, deaccession periodically (annually?), manage data, interact w/ faculty
IT staff implement and operate systemsOperate system, backups , security, upgrading
storage, transport, move to LOCKSS, etc.
October 13, 2010
Open Dialogue, Data Mgmt. 15
Digital Assets - the 4 Pieces The Metadata, ideally on the CSU-DR
1. Typical, what we collect today, e.g. lightweight metadata (probably not copyrightable)
2. Contextual, e.g., user’s manuals (yes, copyrightable)
3. Scholarly publications associated with the data – ideally on the CSU-DR
4. The data itself – should be in the most appropriate place (pointers?)
October 13, 2010
Open Dialogue, Data Mgmt. 16
Digital Assets Management
October 13, 2010
1. Metadata 2. User’sManuals
Libraries-DR
4. Data Sets
Small LargeMedium
3. Pubs
LocalStorage
“The Cloud”“Pointers”
DisciplinaryRepositories,
SC Centers, etc.
17Open Dialogue, Data Mgmt.
Architecture
October 13, 2010
Primary SystemThe Digital Repository
Preservation System
LOCKSS
High-speedNetworks
CSU Storage Project
October 13, 2010 Open Dialogue, Data Mgmt. 18
45 TBytes (raw) for ~$8k
Open Dialogue, Data Mgmt. 19
Strategy for Storage of Data Sets Small, < 100 GB, we would agree to store
on the DR, but not forever Medium, we would agree to store on the DR
for a limited time at a cost, or on a local server somewhere and we point to it
Large, stored on a disciplinary DR somewhere, at a supercomputer center, or at a large instrument centerWe point to it (persistent URL?)
How do we deal with exceptions?
October 13, 2010
Open Dialogue, Data Mgmt. 20
What CSUL will Store & at What Cost
PRESERVATION PERIOD
SIZE
(+ means beyond end of grant period)
SMALL0.1 TB
MEDIUM0.1-10 TB
LARGE> 10 TB
Short (1 yr.+) Free Free Maybe +
Medium (2 yrs. +) Free $500/TB Maybe -
Long (> 5 yrs. +) $1,000/TB $1,000/TB No
October 13, 2010
Forever is a long time…..
Open Dialogue, Data Mgmt. 21
Needs Libraries-IT partnership Define policies for usage Define practice for usage
Definition of workflowsOperations
Develop needed toolsBuild an on-line, self-service submission tool +
requirements for review of user-created metadata Establish systems
Develop preservation infrastructure
October 13, 2010
Open Dialogue, Data Mgmt. 22
Issues Will the DR become a ‘Trusted Digital
Repository?’Will this enhance our proposals?
What will be stored where?Will disciplinary digital repositories emerge, e.g. at
NCAR and elsewhere?Flexibility is key
How best to engageThe VPR (probably already accomplished)The facultyLibrary staff: faculty and operational (DM Librarians at
UNM?)
October 13, 2010
Open Dialogue, Data Mgmt. 23
Discussion
Is most welcome.
October 13, 2010