28
Storing Data “Forever” Funding Long-Term Preservation of Research Data

Storing Data “Forever”

Embed Size (px)

DESCRIPTION

Storing Data “Forever”. Funding Long-Term Preservation of Research Data. Special Thanks To. MacKenzie Smith, MIT Libraries “Managing Research Data 101” https://libshare.library.gatech.edu/clearspace/docs/DOC-3634.pdf;jsessionid=DF96E09B9D6BE9E5EC62A27717DC5868. What is Data?. Numbers? - PowerPoint PPT Presentation

Citation preview

Page 1: Storing Data “Forever”

Storing Data “Forever”

Funding Long-Term Preservation of Research Data

Page 2: Storing Data “Forever”

Special Thanks To

• MacKenzie Smith, MIT Libraries• “Managing Research Data 101”• https://libshare.library.gatech.edu/

clearspace/docs/DOC-3634.pdf;jsessionid=DF96E09B9D6BE9E5EC62A27717DC5868

Page 3: Storing Data “Forever”

What is Data?

• Numbers? – Recorded? Collected? Generated?

• Images? Video? Audio?– Shoah– In what format?

• Code?• Publications/Text?

– In what format?• Transcription service• Is pure “raw” data useful

– May require extensive meta-data to be useful

Page 4: Storing Data “Forever”

What is “Forever”?

• Longer than a typical project?• Longer than a typical career?• Longer than a typical institution?• 5 years, 10 years, 25 years, 100 years?• Suggestion: treat data same way

library treats books• Intent is to preserve indefinitely• As long as practical, feasible• Cannot be precisely defined

Page 5: Storing Data “Forever”

Why Save Data “Forever”

• Because we have to:– Funding agencies want data “sharing”

plans– NIH Data Sharing Policy (2003):http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html“all investigator-initiated applications with

direct costs greater than $500,000 in any single year will be expected to address data sharing in their application.”

Page 6: Storing Data “Forever”

NIH Data Sharing Policy

• “Applicants may request funds for data sharing and archiving. The financial issues should be addressed in the budget section of the application.”

• Specifics depend on grant, published in RFP, RFA or PA

Page 7: Storing Data “Forever”

NSF Data Archiving Policy

• Division of Social and Economic Scienes

• http://www.nsf.gov/sbe/ses/common/archive.jsp

• “Grantees from all fields will develop and submit specific plans to share materials collected with NSF support, except where this is inappropriate or impossible.”

Page 8: Storing Data “Forever”

NSF Data Archiving

• From Grant Proposal Guide• NSF “expects PIs to share with other

researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work.”

• Specifics depend on grant and program officer

Page 9: Storing Data “Forever”

NSF Data Sharing Policy

• Hot off the Presses:Science Insider, May 5 reports:” Edward

Seidel, acting head of NSF's mathematics and physical sciences directorate, described NSF's intention to require all applicants to submit a data management plan along with their grant application in a presentation this morning to the National Science Board, NSF's oversight body. …NSF's current policy requires grantees to share their data within a reasonable length of time so long as the cost is modest. "That's nice, but it doesn't have much teeth," said Seidel. Under the new policy, which is expected to be unveiled this fall, a researcher would submit a data management plan as a two-page supplement to any regular grant proposal. That would make it an element of the merit review process.”

Page 10: Storing Data “Forever”

Other agency Policies

• See Gary King’s Page on “Data Sharing and Replication”

• http://gking.harvard.edu/replication.shtml

• See National Academy of Sciences “Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age”, July, 2009

• http://www.nap.edu/catalog/12615.html

Page 11: Storing Data “Forever”

Why Save Data “Forever”

• Because we want to:– Available to ourselves and our students

and colleagues• Where are the data sitting today? On a

departmental server? On a computer under your desk? On a CD or DVD somewhere?

• Where is your dissertation data?

– Available to future scholars, including ourselves

Page 12: Storing Data “Forever”

Why Save Data “Forever”

• Because we need to:– Encourage honesty?

• Gregor Mendel probably cheated

– Like open-source: help uncover mistakes, bugs?

– Open Data Movement• Mostly library/catalog data, map data, WordNet

– Open Access Movement• Mostly publications

• Because it’s not “our” data

Page 13: Storing Data “Forever”

Current Storage Models

• Let someone else do it– Government agency/lab/bureau

• NOAA National Geophysical Data Center• GenBank (DNA data)• fMRIDC (fMRI publications and data)• NCSA Astronomy Digital Image Library

Page 14: Storing Data “Forever”

Current Storage Models

– Professional society/Journals• Global Ocean Observing System: coordinates

distributed data• Dryad: ecology/evolutionary biology

– Nice folks at another University• ICPSR, University of Michigan (political/social)• Dryad: ecology/evolutionary biology• Protein Data Bank (PDB): 3-D protein data• NCSA Astronomical Image Library• Sloan Digital Sky Survey

– The “Cloud”

Page 15: Storing Data “Forever”

Digital preservation/curation timeline• 2000: Library of Congress: $100M for

National Digital Information Infrastructure and Preservation Program (NDIIPP)

• 2004: UK Digital Curation Centre (DCC)• 2004: NDIIPP gives $14M to 8 partners• 2007: Blue Ribbon Task Force on

Sustainable Digital Preservation and Access

Page 16: Storing Data “Forever”

Digital preservation/curation timeline (2)• 2007: NSF Office of

Cyberinfrastructure (OCI) Sustainable Digital Data Preservation and Access Network Partners (DataNet) solicitation

• 2009: First 2 DataNet awards

Page 17: Storing Data “Forever”

Conferences and groups

• Preservation and Archiving Special Interest Group (PASIG)

• International Conference on Preservation of Digital Objects (iPRES)

• Open Repositories (OR)

Page 18: Storing Data “Forever”

Current Funding Models

• Institution/department pays• Grants pay monthly/yearly• Haphazard

– Some grant money– Some departmental money– Use whatever is available– Don’t worry, someone will pay

Page 19: Storing Data “Forever”

13. Long-term (preservation) storage of research data:# Answer Response %

1 NO 3 16%

2 Yes, centrally run 11 58%

3 Yes, departmentally run

9 47%

4 Yes, run otherwise (specify)

3 16%

What are we Doing? Survey says …

Page 20: Storing Data “Forever”

14. Are your centrally run long-term data storage/preservation systems:

# Answer Response %

1 Funded by charge back

3 27%

2 Funded centrally 10 91%

3 Funded otherwise (specify)

4 36%

Page 21: Storing Data “Forever”

14. Are your centrally run long-term data storage/preservation systems:

Funded otherwise (specify)

grant-funded

central and faculty. There is uncertainty on this front.

also through the condo-style central cluster system

grants

Page 22: Storing Data “Forever”

15. Are your departmentally run long-term data storage/preservation systems:

# Answer Response %

1 Funded by charge back

3 33%

2 Funded departmentally

8 89%

3 Funded otherwise (specify)

3 33%

Page 23: Storing Data “Forever”

Current Funding Models

• Most require some form of on-going payment

• Advantages– Capitalist approach to data storage– If someone wants to pay, data gets saved– “Natural” expiration process

• Disadvantages– Capitalist approach to data storage– Who pays to save rarely used data?

Page 24: Storing Data “Forever”

Different Approach

PAY ONCE, STORE ENDLESSLY (POSE)

Why Pay Once?•Grants expire often and quickly•Researchers expire pretty often

How Store Forever?•Administrators expire slowly •Institutions expire rarely

Page 25: Storing Data “Forever”

The Business Model (1)

• I = Initial cost of storage• D = rate at which storage costs decrease yearly,

expressed as a fraction (e.g., 20% would be 0.2) • R = How often, in years, storage is replaced• T= Cost to store the data “forever”

T = I + (1-d)r * I + (1-d)2r * I + ….

If d=20%, r = 4:

T = I + (.84 )* I + (.88)* I + ….

Page 26: Storing Data “Forever”

The Business Model (2)

If d >0,T = I + (1-d)r * I + (1-d)2r * I + …. = I/(1-d)r

For d=20%, r = 4: T=I * 2

Charge 2x initial storage cost, save half, store forever!

Because this will result in a “surge” in demand for long-term data storage.

The “Serge” Equation

Patent Pending

$0.01/gigabyte

Page 27: Storing Data “Forever”

An Example: DataSpace at Princeton

•FC costs decrease by about 16% per year

•SATA costs decrease by about 17% per year

•Additional savings every few years from new storage

Page 28: Storing Data “Forever”

The “Serge” for DataSpace

• SATA cost = $1.81/gb• Replace every four years• Costs decrease by 20% year

“Serge” = 1.81/(1-.8 **4) = $3/gbAdding tape backup jumps this to

$5/gb$5K one-time to store a terabyte forever