69
31 March 2012 Literature Review #5 Jewel H. Ward Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 1 CITATION Ward, J.H. (2012). Managing Data: the Data Deluge and the Implications for Data Stewardship. Unpublished Manuscript, University of North Carolina at Chapel Hill. Creative Commons License: Attribution-NoDerivatives 4.0 International (CC BYND 4.0)

CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 1

CITATION Ward, J.H. (2012). Managing Data: the Data Deluge and the Implications for Data Stewardship. Unpublished Manuscript, University of North Carolina at Chapel Hill. Creative Commons License: Attribution-NoDerivatives 4.0 International (CC BYND 4.0)

Page 2: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 2

ABSTRACT

Preservation standards for repositories do not exist in a void. They were created to address a particular issue, which is the long-term preservation of digital objects. Preservation repository and policy standards are designed to address long-term digital storage (i.e., digital curation and preservation) by defining “the what” (preservation repository design) and “the how” (preservation policies). This essay focuses primarily on the research data deluge and the implications for the long-term stewardship of data. The conclusion is that researchers want to focus on creating and analyzing data. Some researchers care about the long-term stewardship of their data, while others do not. Effective data stewardship requires not just technical and standards-based solutions, but also people, financial, and managerial solutions. It remains to be seen whether or not funders' requirements for data sharing will impact how much data is actually made available for re-purposing, re-use, and, preservation.

Page 3: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 3

TABLE OF CONTENTS

ABSTRACT ....................................................................................................................... 2

INTRODUCTION .............................................................................................................. 5

DEFINITIONS ................................................................................................................... 6Data, Metadata, and Ontologies ................................................................................... 7Types of Data Collections ............................................................................................. 9Types of Research Data ............................................................................................. 10The Research Data Deluge: What and How Big Is It? ................................................ 15

PRIVACY VERSUS BIG DATA ....................................................................................... 20

WHY SHARE AND PRESERVE DATA? ........................................................................ 23

INFRASTRUCTURE AND DATA CENTERS .................................................................. 27

ROLES AND RESPONSIBILITIES ................................................................................. 30

SUSTAINABILITY ........................................................................................................... 33

RESEARCH DATA CURATION ...................................................................................... 37Data Curation vs. Digital Curation ............................................................................... 37The Research Data Lifecycle ...................................................................................... 39Data Repositories ........................................................................................................ 41Funders' Requirements and Guidance ....................................................................... 41

The National Institutes of Health ............................................................................. 42The National Science Foundation ........................................................................... 44

THE APPLICATION OF POLICIES TO REPOSITORIES AND DATA ........................... 48The Automation of Preservation Management Policies .............................................. 48The Application of Policies to Data and Data Curation ............................................... 49

SUMMARY: THE IMPLICATIONS FOR THE LONG-TERM STEWARDSHIP OF RESEARCH DATA ......................................................................................................... 51

REFERENCES ............................................................................................................... 53

APPENDIX A .................................................................................................................. 62

Page 4: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 4

TABLE OF FIGURES

Figure 1 - The National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels (National Aeronautics and Space Administration, 2010; Ball, 2010). ........................................................................... 12

Figure 2 - Space Science Board Committee on Data Management and Computation (CODMAC) Space Science Data Levels and Types (Ball, 2010). ........................... 14

Figure 3 - Big Data, MGI's estimate of size (Manyika, et al., 2011). ............................... 19Figure 4 - LIFE (Life Cycle Information for E-Literature) Project (Blue Ribbon Task Force

on Sustainable Digital Preservation and Access, 2008). ........................................ 35Figure 5 - I2S2 Idealized Scientific Research Activity Lifecycle Model (Ball, 2012). ...... 40Figure 6 - Entities by Role (Interagency Working Group on Digital Data, 2009). ............ 63Figure 7 - Entities by Role (Interagency Working Group on Digital Data, 2009). ............ 64Figure 8 - Entities by Role (Interagency Working Group on Digital Data, 2009). ............ 65Figure 9 - Entities by Individuals (Interagency Working Group on Digital Data, 2009). .. 66Figure 10 - Entities by Sector with footnotes (Interagency Working Group on Digital

Data, 2009). ............................................................................................................ 67Figure 11 - Individuals by Role (Interagency Working Group on Digital Data, 2009). .... 68Figure 12 - Individuals by Life Cycle Phase/Function (Interagency Working Group on

Digital Data, 2009). ................................................................................................. 68Figure 13 - Entities by Life Cycle Phase/Function (Interagency Working Group on Digital

Data, 2009). ............................................................................................................ 69

Page 5: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 5

INTRODUCTION

Preservation standards for repositories do not exist in a void. They were created

to address a particular issue, which is the long-term preservation of digital objects, i.e.,

"data". Waters & Garrett (1996) wrote that these standards must be created in order for

archives to demonstrate that "they are what they say they are" and they can "meet or

exceed the standards and criteria of an independently-administered program".

Preservation repository and policy standards are designed to address long-term digital

storage (i.e., digital curation and preservation) by defining “the what” (preservation

repository design) and “the how” (preservation policies). The next step is to examine

what types of data are being curated and preserved, put into an “OAIS Reference Model

inside” repository and managed with Audit and Certification of Trustworthy Digital

Repositories Recommended Practices, as well as to examine any related issues and

factors.

Hey and Trefethen (2003) defined the data deluge with an examination of

eScience. The authors called for "new" types of digital libraries for science data that

would provide data-specific services and management. While the data deluge cuts

across all sectors (Manyika, et al., 2011; Science and Technology Council, 2007), this

essay focuses primarily on the research data deluge. It defines research data, the types

of research data and collections; attempts to determine how much data exists; and,

examines “big data” versus privacy. It also describes the reasons researchers do and do

not share their data, the role of data curators, and, provides an overview of

infrastructure. Finally, this literature review describes research data curation; examines

Page 6: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 6

example applications of general data management policies to repositories and to the

data itself; and, discusses the implications for the long-term stewardship of research

data based on the literature reviewed.

DEFINITIONS

What does it mean to "steward" data? The editors of Merriam-Webster (2012)

defined stewardship as, "the conducting, supervising, or managing of something;

especially : the careful and responsible management of something entrusted to one's

care <stewardship of natural resources>". The authors of ForestInfo.org (2012) wrote

that stewardship is, "the concept of responsible caretaking; the concept is based on the

premise that we do not own resources, but are managers of resources and are

responsible to future generations for their condition". Therefore, one may extrapolate

that "data stewardship" is the "careful and responsible management of something

entrusted to one's care" so that future generations may access the data with full

confidence that the data is what the provider says it is.

How does data stewardship differ from digital curation and digital preservation?

Lazorchak (2011) wrote that he has used the terms interchangeably, but they are really

three different processes. The detailed definitions for digital curation and digital

preservation are available in the previous section, "Managing Data: the Emergence &

Development of Digital Curation & Digital Preservation Standards". However, in short,

digital curation addresses the whole life cycle of digital preservation. Lazorchak (2011)

stated that the concept of "digital stewardship…brings preservation and curation

Page 7: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 7

together…pulling in the lifecycle approach of curation along with research in digital

libraries and electronic records archiving, broadening the emphasis from the e-science

community on scientific data to address all digital materials, while continuing to

emphasize digital preservation as a core component of action".

Thus, one might say that digital preservation is the "what"; digital curation is the

"how" for preserving the data; and digital or data stewardship is the "why" (to manage

entrusted resources for future generations). Lynch (2008) wrote that the best data

stewardship "will come from disciplinary engagement with preservation institutions". That

is, if scientists wish to manage their data so that it will be accessible for the indefinite

long-term, then they will need to work with librarians, archivists, computer scientists,

domain specialists, and other information professionals whose expertise lies in the

curation and preservation of data.

Data, Metadata, and Ontologies

What are data, metadata, and ontologies in the context of science research

data? The National Science Foundation Cyberinfrastructure Council (2007) defined

these terms. They wrote that "data are any and all complex data entities from

observations, experiments, simulations, models, and higher order assemblies, along with

the associated documentation needed to describe and interpret the data". Next, the

authors described metadata as a subset of, and about, data. They wrote that "metadata

summarize data content, context, structure, interrelationships, and provenance…. They

add relevance and purpose to data, and enable the identification of similar data in

different data collections” (National Science Foundation Cyberinfrastructure Council,

Page 8: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 8

2007). Finally, the council members defined ontology as "the systematic description of a

given phenomenon. It often includes a controlled vocabulary and relationships, captures

nuances in meaning and enables knowledge sharing and reuse" (National Science

Foundation Cyberinfrastructure Council, 2007).

Employees of the U.S. Office of Management and Budget defined research data

as, "the recorded factual material commonly accepted in the scientific community as

necessary to validate research findings, but not any of the following: preliminary

analyses, drafts of scientific papers, plans for future research, peer reviews, or

communications with colleagues" (National Science Board, 2011). As part of this

definition, the authors also included metadata and the analyzed data. The former may

include computational codes, apparatuses, input conditions, and so forth, while the latter

may include published tables, digital images, and tables of numbers from which graphs

and charts may be generated, among others. Furthermore, they differentiated "digital

research data" from research data by including a separate definition. They wrote that

digital research data is "any digital data, as well as the methods and techniques used in

the creation and analysis of that data, that a researcher needs to verify results or extend

scientific conclusions, including digital data associated with non-digital information, such

as the metadata associated with physical samples" (National Science Board, 2011).

Last, the members of the National Science and Technology Council Interagency

Working Group on Digital Data (2009) wrote that:

“digital scientific data” refers to born digital and digitized data produced by, in the custody of, or controlled by federal agencies, or as a result of research funded by those agencies, that are appropriate for use or repurposing for scientific or

Page 9: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 9

technical research and educational applications when used under conditions of proper protection and authorization and in accordance with all applicable legal and regulatory requirements. It refers to the full range of data types and formats relevant to all aspects of science and engineering research and education in local, regional, national, and global contexts with the corresponding breadth of potential scientific applications and uses (National Science and Technology Council Interagency Working Group on Digital Data, 2009).

Thus, while there is some variation between the definitions of research data, the

general consensus is that it consists of the items or objects that scientists analyze,

create, and use in the process of conducting research.

Types of Data Collections

When data are organized, they become collections. The National Science

Foundation (2005) and the National Science Foundation Cyberinfrastructure Council

(2007) defined three types of data collections: research, resource, and reference

collections. The authors of the 2005 National Science Foundation report chose not to

refer to databases, but collections, because the authors wanted to refer to the

individuals, infrastructure, and organizations indispensable to the management of the

collection. Thus, the board members wrote that data collections fall under one of the

three functional categories mentioned previously.

• Research Data Collect ions: these collections are created for a limited group, supported by a small budget, as part of one or more focused research projects, and may very in size. The researchers who collect the data do not intend to preserve, curate or process it, although this is often due to lack of funding. They may apply rudimentary standards for metadata structure, file formats, or content access policies. Often, there are not standards because the community-of-interest is very small. Some recent examples of these types of collections include Fluxes Over Snow Surfaces (FLOSS) and the Ares Lab Yeast Intron database.

Page 10: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 10

• Resource/Community Data Collect ions: these types of data collections are maintained and created to serve an engineering or science community. The budgets to maintain the collection(s) are provided directly by agency funding and are generally intermediate in size. This funding model can make it challenging to gauge how long the collection will be available, due to changes in budget priorities. However, the community does tend to apply standards for the maintenance of the collection, either by developing community standards or re-purposing existing standards. Two examples of these types of collections include The Canopy Database Project and the PlasmoDB.

• Reference Data Collect ions: Characteristic features of these types of

collections are a diverse set of user communities that represent large segments of the education, research, and scientific community. Users of these data sets include students, educators, and scientists across a variety of institutional, geographical, and disciplinary domains. The managers of these data collections tend to follow or create comprehensive, well-established standards. The creators, users, and managers of these data collections intend to make them available for the indefinite long-term, and budgetary support tends to come from multiple sources over the long-term. The examples for these types of data collections include The Protein Data Bank, SIMBAD, and the National Space Science Data Center (NSSDC) (The National Science Foundation, 2005; National Science Foundation Cyberinfrastructure Council, 2007).

The type of data collection does not necessarily indicate its long-term value to

future researchers, but the collection type does increase the odds of the collection being

usable and accessible within one or more generations. A small, under-funded, poorly

documented research data collection may prove to be of great value to a future

researcher or researchers who can figure out what the data is and how to access it,

while a large, well-funded, and well-documented data collection may have no users after

the original research study closes.

Types of Research Data

Page 11: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 11

The types of data researchers have created fall into three primary categories

used for one or more processes: structured, unstructured, or semistructured [sic].

Members of the National Research Council (2010) described structured data as rigidly

formatted, while unstructured data consists of text. They provided examples of semi-

structured data as consisting of personnel data, want ads, and so forth. The data

researchers have generated in one of these categories may be created by a variety of

processes that generally fall into one of three areas: scientific experiments, models or

simulations, and observations.

The data generated from a scientific experiment is intended to be reproducible, at

least in theory. Researchers often do not have the time and funding to reproduce many

experiments (Lynch, 2008). With regards to model or simulation data, researchers have

preferred to retain the model and related metadata rather than the outputted data.

Scientists have considered observational data to be irreplaceable, as it is usually the

result of data gathering at a specific location and time that may not be reproducible.

They have gathered raw data in the course of observations and or experiments, while

derived data results from combining or processing raw data (Research Information

Network, 2008; Research Information Network, 2008).

The National Aeronautics and Space Administration (NASA), Earth Observing

System (EOS), developed a set of terminology to describe the degree to which data has

been processed (Ball, 2010; National Aeronautics and Space Administration, 2010).

The authors designed it with four data levels, each with subsets; Level 0 is the least

processed and Level 4 is the most processed (see Figure 1, below). Ball wrote that

Page 12: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 12

under this scheme, “data do not have significant scientific utility until they reach…Level

1”.

Figure1-TheNationalAeronauticsandSpaceAdministration(NASA)EarthObservingSystem(EOS)DataProcessingLevels(NationalAeronauticsandSpaceAdministration,2010;Ball,2010).

The author noted that Level 2 has the greatest long-term usefulness, and that

most scientific applications require data processed to at least that level. He described

Level 3 data as being the most “shareable”; those data contain smaller sets than Level 2

data, and are thus easier to combine with other data.

Page 13: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 13

Alternately, Members of the Space Science Board have developed specific

definitions for space data levels and types that range from raw data to a user

description. See Figure 2, below.

Page 14: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 14

Figure2-SpaceScienceBoardCommitteeonDataManagementandComputation(CODMAC)SpaceScienceDataLevelsandTypes(Ball,2010).

Page 15: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 15

These board members considered that space data is not just the data itself, but

also any related documentation needed to access, run, correlate, calibrate, or extract

information from the data.

The authors of the Research Information Network (2008) paper on research data

sharing wrote that researchers and curators further process this data, either by

reduction, annotation, or curation. They noted that researchers often share derived or

reduced data; they do not often share raw data. The authors described how -- once data

has gone through this last process -- it might be made available to other users and

researchers, depending on the implicit and explicit policies of a particular domain.

However, they stated that the trade-off to using derived data is that reproducibility may

be compromised because something may have been lost in the processing.

In addition, the authors noted that if a researcher adds metadata information to

describe the processing techniques used, then the original provenance might be

compromised. They iterated that most researchers prefer to work with raw data, but

practical reasons often prohibit its use by anyone other than the originating researchers.

They described how, when researchers cannot or will not share raw data, sometimes it

is because the data may be in a proprietary format that must be transferable to a more

common format, and that "something" is lost in the transfer. They stated that the reason

for this is that often the raw data set may be too unwieldy to share, or, the researcher(s)

simply may not be willing to share the raw data set (Research Information Network,

2008).

The Research Data Deluge: What and How Big Is It?

Page 16: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 16

Researchers and authors have found it challenging to determine how much data

currently exists, much less how much data exists within science, much less how much

will exist at X point in the future in any field. In order to make an educated estimate, a

researcher must determine what does and does not constitute data. Is it the actual data

created by someone, or the information about them, such as metadata or someone's

digital exhaust? How do you de-duplicate data? Do you count a compressed file or

folder, or an uncompressed file or folder? Another question to consider is, how much is,

"a lot of data"?

Tony Hey and Anne Trefethen's seminal paper (2003) brought attention to the

imminent e-Science data deluge and attempted to quantify the amount of data by

examining Astronomy, Bioinformatics, Environmental Science, Particle Physics,

Medicine and Health, and Social Sciences. Lord and MacDonald (2003) also attempted

to quantify the amount of research data by domain. However, at the time of this literature

review, the numbers in the two papers are around ten years out of date, so the numbers

will not be quoted here as they are no longer relevant -- although the authors' arguments

that a deluge of data is here has remained relevant. The point is that any researcher or

author attempting to quantify and describe "the data deluge" must take into account the

standards of the time, because what is considered "a lot of data" at one time may seem

like "not much data" a generation later.

For example, technologists have often quoted Bill Gates as saying that “640K

ought to be enough for anybody" in 1981 (Tickletux, 2007). (Various authors have written

that he later denied making this statement, but whether or not Bill Gates made that

Page 17: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 17

statement, the point is that users tend to fill up whatever amount of digital storage is

made available to them, and then they will complain that they need more.) Thus, in

1981, researchers used to measuring storage in kilobytes may have considered 10

gigabytes of data to be a "data deluge". Researchers currently speak of data in terms of

exa, zetta, and yottabytes; many, if not most, researchers will concede that "a lot of

data" or a "data deluge" is a relative phrase. One imagines that ancient archivists

managing clay tablets and papyri considered themselves in the midst of a "data deluge"!

Or, a generation from now, future technologists will wonder why curators in the early

2000s considered exabytes "a lot of data". However, whether the amount of data

currently in existence is "a lot" or "not very much" data, analysts have attempted to

quantify the current data deluge using a variety of methodologies.

Thus, more recently in the data deluge estimation universe, Hilbert and Lopez

(2011) examined "all information that has some relevance for an individual" and did not

try to distinguish between duplicate or different information. They considered the

computation time of information, in addition to transmission through time (storage) and

space (communication). Their study spanned two decades (1986-2007) and 60

categories worldwide (39 digital and 21 analog). Their research indicated that as of

2007, "humankind was able to store 2.9x1020 optimally compressed bytes, communicate

almost 2x2021 bytes, and carry out 6.4x1018 instructions per second on general purpose

computers" (Hilbert & Lopez, 2011).

Beginning in 2007, the research firm IDC created an annual report dedicated to

estimating the amount of new digital information generated and replicated. The study

Page 18: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 18

has been sponsored by EMC, an information-management company. IDC's 2007 study

concluded that the world's capacity to store data had been exceeded by our ability to

generate new data. Their projection was that the annual growth rate for data through

2020 would be 40% (Manyika, et al., 2011). In June 2011, the author of the IDC report

wrote, "the amount of information created and replicated will surpass 1.8

zettabytes…growing by a factor of 9 in just five years (Gantz & Reinsel, 2011).

While researchers have examined the amount of data that individuals and

organizations are generating, there is little insight to how much variation there is among

and between the different sectors, such as education, industry, and government.

However, Manyika, et al.'s (2011) research indicated that while the Library of Congress

(LC) had collected 235 terabytes of data as of April 2011, fifteen out of seventeen

sectors in the USA have more data stored than the LC. For example, James Hamilton,

the Vice President of Amazon, has noted that the amount of capacity Amazon ran on in

all of 2001 is currently added to its data centers daily (Gallagher, 2012). Hamilton's

comment has reinforced the earlier point that "a lot of data" is a relative term; one

imagines that Amazon's employees considered that they processed and stored "a lot of

data" in 2001, especially relative to their storage and processing capacities in the 1990s.

Page 19: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 19

Figure3-BigData,MGI'sestimateofsize(Manyika,etal.,2011).

Regardless of whether or not the current data-intensive environment is a

"deluge", one must consider current technology and demands versus processing and

storage requirements. Manyika, et al., (2011) determined that critical mass has been

reached in every sector, but the level of the intensity of the data generated varies. They

determined these aggregate results by examining four factors: utilization rate, duplication

rate, average replacement cycle of storage, and annual storage capacities shipped by

sector (please see Figure 3, above). The consultants found that for the year 2010, the

amount of data stored in enterprise external disk storage for one year is 7.4x1018 bytes,

including replicas. Their research indicated that for the same year, consumers generated

6.8x1018 bytes. Furthermore, Gallagher (2012) wrote that "Google processes over 20

Page 20: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 20

petabytes of data per day" on searches alone. One must concede that, given current

technologies versus user demands and expectations, that is a lot of data.

PRIVACY VERSUS BIG DATA

It is important to note that data is not just about the content that is created -- it is

also about the information around it. These sources include browsing histories,

geographic locations, and other metadata and "digital exhaust" (Gantz & Reinsel, 2011).

The two authors wrote that the amount of information being created about users of data

is greater than the amount of data and information users are creating themselves. Evans

and Foster (2011) stated that this "metaknowledge" -- knowledge about knowledge --

may include idioms particular to a domain or scientist, the status and history of

researchers when included in a paper, as well as the focus and audience of a particular

journal. The authors argued that studying metaknowledge could provide useful

information about the spread of ideas within a research domain, particularly from teacher

to student.

However, metaknowledge may also be considered digital exhaust. Evans and

Foster (2011) described the former term as the explicit information about someone that

is publicly available, such as a short biography submitted by an author as part of a

paper. Burgess (2011) defined digital exhaust as the information all users leave behind

when using digital technology. This exhaust can be as innocuous as a name in the

metadata of a Microsoft Word document that allows a researcher to determine whom his

Page 21: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 21

or her anonymous reviewer is, to browsing history, to one's physical location as

determined by one's location to a cell phone tower.

The data that is generated about individuals may be unimportant, but, en masse,

gives government and corporations an incredible amount of data and information about

individuals that has previously been private. This information may include Tweets,

photos, emails, Facebook posts, etc., etc. For example, Hough (2009) discusses a study

in which 75% of Facebook users post information indicating that they are out of town,

thus putting themselves at risk of a break-in.

Sullivan (2012) wrote an article describing university and government agencies'

demands for athletes' and job applicants' Facebook account user names and passwords

in order to better monitor each person's personal habits and preferences. Some state

legislators are banning the practice, citing the first amendment. Solove (2007) argues

that just because an individual may not have anything to hide, does not mean that he or

she must share their personal data, while Hough (2009) declares that individuals should

not be so willing to give up their privacy as the price of using technology. Hough cites a

study by Sweeny (2002) in which 87% of the population of the United States may be

uniquely identified using only 1990 census data -- gender, date of birth, and a five-digit

zip code. Sweeny proved it is fairly easy to determine an individual's Social Security

Number, particularly if that individual was born after 1980 -- simply by knowing their date

and place of birth.

As well, Sweeny (2002) provided one of the most famous examples of how easy

it is to find individual information. The researcher correlated the information contained in

Page 22: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 22

a public data set provided by the primary state employee health care provider for

Massachusetts with publicly available voter registration data. The voter rolls contained

each individual’s name, birth date, address, gender, and zip code. The data set provided

by the Massachusetts state employee health care provider contained each anonymized

individual's birth date, zip code, gender and their individual medical information, such as

medications and procedures. Sweeny used this information to find then-Massachusetts

Governor Weld's medical records, and promptly requested and mailed his own records

to him! She found his medical records by matching shared attributes: Governor Weld

then lived in Cambridge, Massachusetts. Based on the voter rolls, six people in

Cambridge had the same birth date, three were men, and only one lived in Weld's 5-digit

zip code.

A few years later, in March 2010, Netflix cancelled an annual prize competition to

develop better recommendation algorithms, due to privacy concerns. Narayanan and

Shmatikov (2007) correlated the supposedly anonymized user data Netflix had provided

to the contest's participants and compared it to data from the Internet Movie Database.

The researchers claimed they successfully identified the Netflix records of known users,

thus revealing their implied political views and other potentially sensitive information.

Thus, researchers must be careful about what data they release, how much, and

to whom. Even supposedly anonymized data may provide enough detail to be

dangerous when correlated with other publicly available data.

Page 23: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 23

WHY SHARE AND PRESERVE DATA?

The National Science Foundation and the National Institutes of Health in the

United States, as well as major research funders in the United Kingdom, now require the

researchers they fund to provide data management plans and be prepared to share the

data generated from their research (National Science Foundation, 2010; National

Institutes of Health, 2010; and, Jones, 2011). The policy arguments for sharing data are

primarily based around two reasons: to ensure the reproducibility and replicability of

science; and, so that the results of taxpayer funded research are made re-usable in

order to maximize the returns from the high costs involved in gathering the data initially

(Borgman, 2010; National Science Board, 2011).

As noted above, observational data is the most vulnerable with regards to

reproducibility because it is based on a specific time and place; experimental data and

model data are replicable in theory. However, if these data are curated in the appropriate

formats with the required software, hardware, and any related scripts, then the research

results should be replicable. Borgman (2010), Lynch (2008), Fry, et al. (2008), and, Lord

and MacDonald (2003) stated that the reasons librarians and libraries should curate the

outputs of scientific research are pretty simple: curation is not an end in itself, it is a way

of supporting science by providing methods for access, use, re-use, and a more

complete and transparent record of science. However, the members of the National

Science Board (2011) have made the point that a one-size-fits-all approach to data

sharing is neither desirable nor feasible. Instead, the National Science Foundation

Page 24: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 24

(2010) has encouraged each domain to establish its own standards for data

management.

Other policy reasons cited by the National Research Council (2010) and

Borgman (2010) included the creation of new science based on new questions of

existing data, such as finding patterns, and advancing research in general by creating a

new set of data-intensive methods that move science beyond theory, simulation, and

empiricism, i.e., "the 4th paradigm". Wired Magazine's Chris Anderson (2008) took the

4th Paradigm idea too far, however, when he declared that "the data deluge makes the

scientific method obsolete". As Borgman (2010) observed, "access to data does not a

scientist make", as rigorous data analysis requires a certain amount of expertise to

accurately interpret often-complex information and associated metadata. Fry, et al.

(2008) cited a study in which researchers expressed concern that public access to

research data would only increase confusion, rather than transfer any useful knowledge

to the general public.

Given the potential dangers of providing data to others for their use and re-use,

as noted in a previous section, why should a researcher share their data with anyone?

The reasons vary, but generally involve coercion (i.e., a funder requires it); a

requirement for reciprocal data sharing; the collaboration value; costs are reduced by

preventing duplicate data collection; and, a desire to support the scientific method and

ensure that studies are replicable (Borgman, 2010; Van den Eynden, et al., 2011).

Researchers' willingness to share their data varies by domain; it is rare for climate

Page 25: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 25

scientists to share their data or re-use another researcher's model-run data. Therefore,

climate scientists have little incentive to repurpose data for re-use.

However, for those researchers who work in a domain that shares data formally

or informally, such as Astrophysics (Harley, et al., 2010), the Research Information

Network (2008) study indicated that other incentives for sharing include paper co-

authorship opportunities, greater collaboration opportunities, and greater visibility for the

researcher's institution and research group. Regardless of whether or not a particular

domain encourages data sharing, Borgman (2010, 2008) wrote that publication is still the

route to success and rewards, not data sharing, although research productivity is shown

to increase with both informal and formal data sharing, especially with secondary

publications (Pienta, Alter, & Lyle, 2010).

Borgman (2010, 2008) and Fry, et al. (2008) also noted that other disincentives

to sharing data are the time and resources required to re-purpose the data; the

researcher's inability to control their intellectual property; and, concerns that their

research results will be "scooped" by another researcher, if no embargo period on data

sharing is required and enforced. In addition to these disincentives for data sharing,

Lynch (2008), Fry, et al. (2008) and Cragin, et al. (2010), listed legal and ethical

constraints, lack of expertise in data management, a lack of time to handle data

requests, and a lack of technical infrastructure in which to publicly archive the data.

Scholars prefer to perform research and write the publications rather than curate

data for re-use and storage (Lynch, 2008; Harley, et al., 2010). However, Pienta, Alter

and Lyle (2010) studied the use and re-use of Social Science primary research data, and

Page 26: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 26

their research indicates that while informal data sharing is the norm in the Social

Sciences, the sharing of data via an archive "leads to many more times the publications

than not sharing data".

Publications such as Science and Nature have called upon the larger science

communities to create the infrastructure to share and curate data for the indefinite near

term (Hanson, Sugden & Alberts, 2011; Editor, 2009, 2005). The editors of Science, for

example, require authors to submit not just a copy of the data itself, but any computer

code required to read the data. The Toronto International Data Release Workshop

Authors (2009) examined prepublication data sharing within genomics, and they

recommended that it be extended to related domains. At the opposite end, Schofield, et

al., (2009) discussed ways to promote data sharing among mouse researchers in an

opinion piece. The authors concluded that a research commons must be created, but

that data sharing would require an entire culture change for their field.

Curry (2011) provided an example of particle physicists who rescued an old data

set from the 1990s; these physicists then wrote more than a dozen new high-impact

papers from this same set. In spite of these examples, and the support of major

publications, Nelson (2009) wrote that the power to "prod" researchers to share their

data must come from the organizations that have real clout with researchers: the funding

agencies, scientific societies, and journals. However, as Lynch (2008) noted, the best

use of scientists' time is to devote it to practicing science. He wrote that researchers are

not the best at data management, and this area should be left to professional data

stewards.

Page 27: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 27

Thus, it appears that most managers of major funding agencies, librarians and

archivists, scientists, and journal editors and authors have been encouraging or requiring

data sharing among researchers. However, whether or not a researcher is willing to do

so may depend on a variety of factors, including personal preference. So long as data

analysis takes up the majority of researchers' time, they may not have the resources to

share data, even with the appropriate infrastructure and policies in place, given the

amount of time it takes a researcher to prepare data for use, re-use, and long-term

preservation (Research Information Network, Institute of Physics, Institute of Physics

Publishing, & Royal Astronomical Society, 2011). Thus, taking into account most

researchers' resource constraints, how well and often a researcher may share his or her

data, even if they are willing to do so, is still to be determined, in spite of funders'

requirements.

INFRASTRUCTURE AND DATA CENTERS

Researchers may find incentives to share their data, as more data-centric

infrastructure becomes the norm, even in domains in which data sharing is not the norm.

However, as Lynch (2008) concluded, one of the issues that must be clarified concerns

what institution or domain is responsible for providing the underlying infrastructure and

data stewardship. Some librarians think that it is the library's responsibility to provide this

infrastructure; others believe it is better for each domain to come together and create

this infrastructure, given the proprietary nature of data formats, software, etc.; while still

others promote the concept of national data centers; and, finally, some data managers

Page 28: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 28

prefer institution-based infrastructure (Walters & Skinner, 2011; Research Information

Network, 2011; UKRDS, 2008; Soehner, Steeves & Ward, 2010).

The members of the Association of Research Libraries (ARL) institutions have

described four models of data infrastructure to support e-science: multi-institutional

collaborations; a decentralized or unit-by-unit approach; a centralized or institution-wide

response; or, a hybrid centralized and decentralized approach (Soehner, Steeves, &

Ward, 2010). Lyon (2007) derived a "domain deposit model" and a "federation deposit

model" from her study results. She described the domain deposit model as a "strong

integrated community…with well-established common standards, policy and practice",

and defined the federation deposit model as a group of repositories which have come

together "based on some agreed level of commonality" in a documented partnership.

The author wrote that the "federation deposit model" might be built around an institution,

regional geographic boundaries, format type, or software platform.

The debate over who will provide infrastructure, and what model that support

service will follow, is similar to the problems that arose with the development of

Institutional Repositories (IRs) in the 2000s. arXiv, while not an Institutional Repository

per se, grew out of the Physics community's culture of sharing research results

immediately, and has grown to encompass Computer Science, Astrophysics, and

Mathematics, among others, but that does not mean the arXiv model fits all e-print

needs for all domains or institutions (Ginsparg, 2011). Researchers' needs have been

heterogeneous, as are each field's communication styles and technical expertise (Kling

& McKim, 2000; Borgman, 2008). Foster and Gibbons' (2005) study proved that

Page 29: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 29

librarians eagerly built Institutional Repositories, only to find a lukewarm reception from

faculty and researchers, which led to a lack of IR content.

The Research Information Network (2009) studied life sciences researchers and

noted that one infrastructure and data sharing model will not fit all research domains,

and that the information practices of life scientists do not match that of information

practitioners and policy makers. Librarians may wish to grow data-sharing infrastructure

more carefully than they did IRs, and grow them based on need, rather than the latest

trend. So far, however, researchers have seemed to value data centers and they have

stated that their existence has improved their ability to "undertake high quality-research"

(Research Information Network, 2011). Currently, whether or not one or more of the

above-mentioned ARL models will prove to be the best choice is in flux, generally

because each institution and domain has different needs and requirements.

As regards other areas of big data infrastructure, such as preservation repository

design and policies, those topics were covered in the previous sections, "Managing

Data: Preservation Repository Design (the OAIS Reference Model)," and "Managing

Data: Preservation Standards and Audit and Certification Mechanisms (i.e., ‘policies’)".

Other, more technical discussions, such as over-the-network and local data processing,

data discoverability and indexing, physical networking infrastructure, interoperability,

security, data center design, syncing, data replication, data backups, etc., are beyond

the scope of this essay.

Page 30: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 30

In conclusion, the results of the studies discussed in this essay have indicated

that for data to be stewarded for the long-term, research scientists will need some type

of support infrastructure, both technical, financial, and management.

ROLES AND RESPONSIBILITIES

Lyon (2007) observed that there was a "dearth of skilled practitioners, and data

curators play an important role in maintaining and enhancing the data collections that

represent the foundations of our scientific heritage". The author wrote that in time,

"native data scientists" would emerge from within each domain's curriculum as data

management becomes integrated into graduate research training. Gray, Carozzi and

Woan (2011) noted, "normal data management practice…corresponds to notably good

practice in most other areas". Their recommendation was for administrators to formalize

data management planning in order to make it more auditable. One aspect of this

formalization is to define the roles and responsibilities by individual, role, and sector.

The members of the National Science Board (National Science Foundation,

2005) defined the primary roles and responsibilities of institutions and individuals. They

defined four primary individual roles: data authors, data managers, data scientists, and

data users.

• Data Author: this individual is involved in research that produces digital data. This person should receive credit for the production of the data, and ensure that it may be broadly disseminated, if appropriate. The data author must ensure that the metadata, and data recording, context, and quality all conform to community standards.

• Data Manager: this individual is responsible for the maintenance and operation of the database. This person must follow best practices for technical management such as replication, backups, fixity checks, security,

Page 31: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 31

enforcement of legal provisions, and implement and enforce community standards and preferences for data management. The data manager must provide appropriate contextual information for the data, and design and maintain a system that encourages data deposit by making it as simple and easy as possible.

• Data Scientist: the individuals who are data scientists have a variety of roles. This person may be a librarian, archivist, computer or information scientist, software engineer, database manager or other disciplinary expert. His or her contributions involve advising on the implementation of technology and best practices to the data for long-term stewardship and ensuring that it is implemented properly, as well as enhancing the ability of domain scientists to conduct their research using digital data. This role involves creative inquiry, analysis, and outreach, as well as participating in research appropriate to the data scientist's own domain, for the purposes of publication and contributing to research progress.

• Data Users: this individual is a member of the larger research and scientific community, and this person will benefit from having access to data sets that are well-defined, searchable, robust, and well-documented. The data user must credit the data author, adhere to copyright and other restrictions, and must notify the appropriate data managers or data authors of any data errors (National Science Foundation, 2005).

The National Science Foundation (2005) authors also defined the responsibilities

of the funding agencies. They stated that these agencies must provide a science

commons to enable data sharing, help to create a culture in which data sharing is

rewarded, and enable access to data across research communities. The board members

were adamant that the representatives of the funding agencies, the agencies

themselves, the various individuals, and their respective institutions, all have a part to

play in ensuring the long-term stewardship of data.

Swan and Brown (2008) examined the roles and career structures of data

scientists and curators in order to provide recommendations for their career

Page 32: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 32

development. They defined and examined both the roles and the career trajectories of

those who manage the data itself. First, the authors distinguished the following four roles

based on interviews of practicing data scientists and curators.

• Data creator: researchers with domain expertise who produce data. These people may have a high level of expertise in handling, manipulating and using data • Data scientist: people who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology • Data manager: computer scientists, information technologists or information scientists and who take responsibility for computing facilities, storage, continuing access and preservation of data • Data l ibrarian: people originating from the library community, trained and specialising in the curation, preservation and archiving of data (Swan & Brown, 2008).

Next, the authors interviewed practitioners regarding their roles and

responsibilities, and career satisfaction. They discovered that most data scientists

moved into their roles "by accident rather than design"; that "there is no defined career

structure"; and that they feel undervalued within their research group due to the lack of

professional training and/or a defined career path. Swan & Brown (2008) described three

primary roles for libraries with respect to data stewardship. First, librarians must provide

preservation and archiving services for data, particularly through Institutional

Repositories. Second, they must provide consulting and training for data creators. Third,

librarians must develop training curriculum for data librarians.

Page 33: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 33

The Interagency Working Group on Digital Data (2009) defined the various roles

involved with "harnessing the power of digital data for science and society". They

described the entities by role, individual, sector, and life cycle phase/function, and the

individuals by role and life cycle phase/function. They defined entities as research

projects, data centers, libraries, archives, etc., and defined the role for each one and

provided an existing example. For example, they authors provided eleven tasks under

"role" for the entity "archives", and provided the name of the National Archives and

Records Administration as an example. They defined eleven different types of individual

roles, including data scientist, librarian, and researcher, for example, along with a

corresponding definition for the role. Please go to Appendix A to view the complete set

of tables as Figures 6-13.

In conclusion, the authors above have demonstrated that while one person may

take on the multiple roles of data creator, data scientist, data manager, and data user, it

ultimately takes an entire team and community to ensure the long-term survivability of

research data.

SUSTAINABILITY

General funding and sustainability estimates are covered in the section,

"Managing Data: the Emergence & Development of Digital Curation & Digital

Preservation Standards". This section will focus on sustainability issues related to data

sets, per se.

It is as challenging for information practitioners to determine the true cost of data

stewardship as it is for them to measure the amount of digital data. Gray, Carozzi and

Page 34: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 34

Woan (2011) cited several studies and existing science archives, including one that had

been built recently by an experienced archive staff. The authors wrote that staff costs, as

well as acquisition and ingest costs, account for the substantial portion of preservation

project funding, which reflected Lord and MacDonald's (2003) earlier findings. They did

not provide any hard numbers, though, and noted that those costs only scaled weakly as

an archive grew larger. In other words, they learned that an archive's initial size governs

the costs, and that when an archive starts small and grows larger, the costs do not

scale.

Gray, Carozzi and Woan (2011) called for a costing model to be developed, as

they found that there is a lack of consensus on the long-term costs related to the

preservation of large-scale data. Lord and MacDonald (2003), Lyon (2007), Fry, Lockyer,

and Oppenheim (2008), and Ball, (2010) have all previously made calls for the

development of a solid cost model as well, as they had found it challenging to determine

the "full costs of curating data". One of the primary reasons driving the confusion

regarding how much data stewardship will cost is determining who is responsible (re:

who will pay) for data stewardship and the differing degrees of data curation (Blue

Ribbon Task Force on Sustainable Digital Preservation and Access, 2008; Fry, Lockyer,

and Oppenheim, 2008).

The problems the Blue Ribbon Task Force on Sustainable Digital Preservation

and Access (2008) defined as barriers to developing an accurate cost model are

systemic, rather than simply about finding and setting a price for the product. The

problems they identified include: the idea that "current practices are good enough"; the

Page 35: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 35

fear of addressing adequate data stewardship because it is "too big"; inadequate

incentives to support the group effort needed to create sustainable economic models; a

lack of long-term thinking regarding funding models; and lack of clarity and alignment

with regards to the various responsibilities and roles between data stakeholders.

The Task Force reviewed several models including the LIFE (Life Cycle

Information for E-Literature) project and the model by Beagrie, Chruszcz, and Lavoie

(2008). The members of the LIFE project aimed the model towards libraries, and one of

their discoveries has been that "upfront (i.e., one-time) costs of a project are often

distinct in structure from the recurring maintenance aspects of the same project" (Blue

Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).

Figure4-LIFE(LifeCycleInformationforE-Literature)Project(BlueRibbonTaskForceonSustainableDigitalPreservationandAccess,2008).

Page 36: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 36

So, for example, when SHERPA-DP IR used the LIFE model to determine their

full lifecycle costs, they determined that, excluding interest rates and depreciation, their

costs measured at the unit for which metadata is created (e.g., per object cost for

analogue, per page cost for digital) are:

• Year 1: 18.40 English pounds per year total cost • Year 5: 9.70 English pounds per year total cost • Year 10: 8.10 English pounds per year total cost (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).

Beagrie, Chruszcz, and Lavoie (2008) developed a model to inform institutions of

higher learning of their preservation costs. They built upon the work of the LIFE project

team, and mapped it to the Trustworthy Repositories Audit & Certification: Criteria and

Checklist and the OAIS Reference Model. The authors discovered upon the application

and testing of the model "that the costs of preservation increase, but at a decreasing

rate, as the retention period increases" (Blue Ribbon Task Force on Sustainable Digital

Preservation and Access, 2008). The administrators of the Archaeology Data Service

studied and re-adjusted their charging policy after applying Beagrie, Chruszcz, and

Lavoie's (2008) method to examine staff salaries, time, and days, and, therefore,

reached a more realistic assessment of costs.

Beagrie and JISC (2010) summarized the model in a fact sheet that outlined

recommendations to funders and institutions regarding what costs most (acquisition and

ingest), the impact of fixed costs (they do not vary by the size of the collection and staff

costs remain high), and the declining costs of preservation over time (they decline to

Page 37: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 37

minimal levels after 20 years). The authors outlined the benefits (direct, indirect, near-

and long-term, private and public) to preserving research data; those benefits have been

outlined throughout this paper. The authors discussed the various types of repositories

and recommended a federated model with local storage at the departmental level, with

additional back up at the institutional level. They also encouraged institutions to work

with existing archives over creating new ones. Finally, they pointed out that research

data are heterogeneous and are less likely to be stored in an Institutional Repository.

In conclusion, while Beagrie, Chruszcz, and Lavoie (2008) and the LIFE project,

among others, have developed substantive cost models that provide very useful financial

information for repository managers, these will need to be revised and updated over the

long-term in order to determine the accuracy of the respective models.

RESEARCH DATA CURATION

General data curation is covered in another section, "Managing Data: the

Emergence & Development of Digital Curation & Digital Preservation Standards".

Therefore, the remainder of this section will address only those areas related to research

data curation that were not covered in the previous literature review on digital curation.

According to Ball (2010), the curation of research data is best understood in

terms of the research data life cycle, data repositories, and funders' requirements and

guidance.

Data Curation vs. Digital Curation

What is data curation, and how does it differ from digital curation, if at all? First, it

is important to note that the curation of scientific data goes back centuries. Data curation

Page 38: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 38

is an older term than "digital curation". It applied to journals, reports, or databases that

were selected, annotated, normalized and integrated to be used and re-used by other

researchers or historians. These data were not and are not always in digital form. Data

curation is a less broad concept than digital curation, and although the two phrases are

often used synonymously, they are not interchangeable (Ball, 2010).

Second, to further clarify, Lord and MacDonald (2003) included the following

tasks as part of data curation:

• Selection of datasets to curate.

• Bit-level preservation of the data.

• Creation, collection and bit-level (or hard-copy) preservation of metadata to support contemporaneous and continuing use of the data: explanatory, technical, contextual, provenance, fixity, and rights information. Surveillance of the state of practice within the research community, and updating of metadata accordingly.

• Storage of the data and metadata, with levels of security and accessibility appropriate to the content.

• Provision of discovery services for the data; e.g. surfacing descriptive information about the data in local or third-party catalogues, enabling such information to be harvested by arbitrary third-party services.

• Maintenance of linkages with published works, annotation services, and so on; e.g., ensuring data URLs continue to refer correctly, ensuring identifiers remain unique.

• Identification and addition of potential new linkages to emerging data sources.

• Updating of open datasets.

• Provision of transformations/refinements of the data (by hand or automatically) to allow compatibility with previously unsupported workflows, processes and data models.

• Repackaging of data and metadata to allow compatibility with new workflows, processes and (meta)data models (Ball, 2010).

Page 39: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 39

The authors included curation tasks that are part of the broader concept of digital

curation, such as bit-level curation, metadata creation, and selection. They provisioned

for data curation when they included data transformation and refinement, and

repackaging -- e.g., data clean up -- all of which are tasks not normally associated with

the curation of, say, digital objects consisting of e-prints or photographic images.

The Research Data Lifecycle

Ball (2012) wrote that lifecycle models help practitioners plan in advance for the

various stages involved in the stewardship of digital data. There are several lifecycle

models available for guidance. The author described the "I2S2 Idealized Scientific

Research Activity Lifecycle Model" as a model produced from the researchers'

perspective, while the "DCC Curation Lifecycle Model" is produced from the perspective

of information professionals. These two lifecycle models are a representative sample of

the information available in the various lifecycle models and were chosen with that in

mind; time and space limitations prohibit a longer discussion of the pros and cons of all

lifecycle models.

Thus, this section will discuss the "I2S2 Idealized Scientific Research Activity

Lifecycle Model", and will attempt to describe the common themes across several

available data management lifecycle models. The "DCC Curation Lifecycle Model" is

covered in a previous essay, "Managing Data: the Emergence & Development of Digital

Curation & Digital Preservation Standards".

The members of the I2S2 project created the "I2S2 Idealized Scientific Research

Activity Lifecycle Model" with the researcher's perspective in mind, not the data

Page 40: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 40

manager's perspective. Thus, Ball (2012) wrote that archiving is a very small part of the

lifecycle. The goal of the project team members was to integrate, accelerate, and

automate the research process, and so they created this lifecycle model in support of

those goals. They designed the model to support research activity, not data

management per se. Thus, they outlined the tasks involved throughout the lifecycle of a

research project.

Figure5-I2S2IdealizedScientificResearchActivityLifecycleModel(Ball,2012).

The project team designed the model with four key elements: curation activity,

research activity, publication activity, and administrative activity. They sketched out the

curation activity as a task performed by a data archive or repository. They outlined the

administrative activity as the process of applying for funding, providing for reports, and

writing final reports. The authors of the model defined publication activity as those tasks

Page 41: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 41

involved with preparing the data for public use and the writing and publication of papers.

And, finally, they defined the research activity as that part of the project that involves

conducting the research itself.

The data management lifecycle models included in this section for purposes of

describing themes common across all life cycles are: the Interagency Working Group on

Digital Data (IWGDD) Digital Data Lifecycle Model (Interagency Working Group on

Digital Data, 2009); the Data Document Initiative (DDI) Combined Life Cycle Model; the

Australian National Data Service (ANDS) Data Sharing Verbs; the DataONE Data

Lifecycle; the UK Data Archive Data Lifecycle; Research360 Institutional Research

Lifecycle; and, the Capability Maturity Model for Scientific Data Management (Ball,

2012).

The themes common across all lifecycle models include planning the project;

gathering, processing, analyzing, describing and storing the data; and, archiving the data

for future use. It is interesting to note that only the "DCC Curation Lifecycle Model"

provides for the deletion of data; an unstated assumption by the authors in the remaining

models is that all data will be re-used and re-purposed.

Data Repositories

This section's content is discussed in the previous essay, "Managing Data:

Preservation Repository Design (the OAIS Reference Model)".

Funders' Requirements and Guidance

Administrators at both the National Institutes of Health and the National Science

Foundation now require researchers to provide data management plans in their grant

Page 42: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 42

proposals. They have instituted policies that require researchers to make the resulting

research data from the grant-funded project available for re-use within a reasonable

length of time.

The National Institutes of Health

The authors of the National Institutes of Health (2010; 2003) (NIH) requirements

have mandated that researchers share the final data set once the publication of the

primary research findings has been accepted. They have allowed for large studies, and,

as such, the data sets from large studies may be released in a series, as the results from

each data set are published or as the data set becomes available.

The administrators at the NIH have required that all organizations and individuals

receiving grants make the results of their research available to the public and to the

larger research community. They have required a simple data management plan for any

grant proposals requesting more than $500,000. If a researcher cannot share the data,

then they must provide a compelling reason to the NIH in the data management plan.

The grantors at the NIH have asked grantees to provide the following information

in the data management plan: mode of data sharing; the need, if any, for a data sharing

agreement; what analytical tools will be provided; what documentation will be provided;

the format of the final data set; and, the schedule for sharing the data.

The following are three examples that employees of the NIH have provided to

grant applicants as example data management plans.

• Example 1 The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome.

Page 43: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 43

This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects. Therefore, we are not planning to share the data.

• Example 2

The proposed research will include data from approximately 500 subjects being screened for three bacterial sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement that provides for: (1) a commitment to using the data only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed.

• Example 3

This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years. Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/ User registration is required in order to access or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource. Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to users will not be used for commercial purposes, and will not be redistributed to third parties. (National Institutes of Health, 2003) The implementers of the NIH data management plans wanted to make them as

simple as possible, as these plans are but one part of the NIH grant application.

Page 44: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 44

However, it is evident to most information professionals, that these plans are not

adequate for long-term data stewardship.

The National Science Foundation

The authors of the National Science Foundation (2011) policy on data

management wanted to provide a way to share data within a community while

recognizing intellectual property rights, allow for the preparation and submission of

publications, and protect proprietary or confidential information. They have made it clear

to grant recipients that they must facilitate and encourage data sharing.

The reviewers have required grant applicants to include a 2-page supplementary

document entitled, "Data Management Plan". The grant applicants must describe how

any data resulting from the NSF-funded research will be disseminated and shared via

NSF's policies. The authors of the NSF's data management plan (DMP) policy have

recognized that each of the seven directorates have different cultures and requirements

for data sharing. Therefore, the administrators at the NSF have given each directorate

leeway to determine the best data management practices for each domain, including

whether or not researchers must deposit data in a public data archive (Hswe and Holt,

2010).

These policy makers have defined the following areas as items that may be

included in a data management plan.

• The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project.

• The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies).

Page 45: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 45

• Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements.

• Policies and provisions for re-use, re-distribution, and the production of derivatives.

• Plans for archiving data, samples, and other research products, and for preservation of access to them (National Science Foundation, 2011).

The authors of the data management plan policy have allowed for exceptions to

the policy. They stated that grant applicants may include a data management plan that

includes "the statement that no detailed plan is needed, as long as the statement is

accompanied by a clear justification" (National Science Foundation, 2011).

Due to the fact that grant administrators at the NSF have only required the data

management plans from grant applicants beginning in January 2011, unlike with the

NIH, examples of data management plans from successful grant applications are not yet

available.

Librarians and archivists in the United States have drawn heavily upon the work

performed by the employees and researchers of the Joint Information Systems

Committee (JISC) and the Digital Curation Centre (DCC) in the United Kingdom. Most

academic and research librarians at major research universities and related institutions

have provided a plethora of online templates, tools, and resources for NSF grant

applicants to use. While there is some variation in minor details, most of the data

management plans created by information professionals contain the same elements.

The Interuniversity Consortium for Political and Social Research (ICPSR) (2012) and the

Page 46: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 46

California Digital Library (2012) are among those institutions and individuals that have

developed extensive data management plan guidance for researchers.

Information professionals at ICPSR compiled their recommended elements for a

data management plan that researchers may draw from when compiling a plan for either

the NSF or NIH. They recommended that researchers include a description of the data;

a survey of existing data; the existing formats of the data; any and all relevant metadata;

data storage methods and backups; data security; the names of individuals responsible

for the data; intellectual property rights; access and sharing; the intended audience; the

selection and retention period; any procedures in place for archiving and preservation;

ethics and privacy concerns; data preparation and archiving budget; data organization;

quality assurance; and, legal requirements (ICPSR, 2011).

Researchers and employees of the California Digital Library created the "Data

Management Plan Tool" (2012) based on prior work by the Digital Curation Centre

(2012) to allow researchers to quickly create a legible plan suitable to their particular

funder's requirements. For example, the authors of the tool took into account each NSF

directorate's requirements and created a separate template based on those

requirements. They included funding agencies such as the Institute of Museum and

Library Services (IMLS), the Gordon and Betty Moore Foundation, the National

Endowment for the Humanities (NEH), and, of course, the NSF. They did not include a

template for the NIH. The authors created the templates so that outputs in the final

document created by the researcher may include information about data types,

metadata and data standards, access and sharing policies, redistribution and re-use

Page 47: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 47

policies, and archiving and preservation policies. They designed the templates to output

only the fields the researcher completes, so while there are standard templates based

on requirements, the output may vary based on the information provided by the

researcher (California Digital Library, 2012).

Carlson (2012) created a data curation profile toolkit for librarians and archivists

to use when interviewing researchers about their data. While Carlson did not create this

toolkit in support of the NSF requirements, reference librarians may find it a useful

resource for questions to draw upon when they collaborate with a scientist. The author

designed the toolkit as a semi-structured interview to assist librarians in conducting a

data curation assessment with a researcher. Carlson created a user guide, an

interviewer's manual, an interview worksheet, and a data curation profile template. He

designed the questions to elicit the information required to curate data; most of the

information required from the researcher maps to the recommended elements of the

ICPSR Data Management Plan, above.

In conclusion, information professionals have been working hard to assist

researchers in developing appropriate planning tools with which the researchers may

steward the data. However, many researchers are unaware of these services, or

consider them to be yet another bureaucratic hurdle (Research Information Network,

2008). It remains to be seen whether or not data creators will use the services

information professionals have made available. It also remains to be seen whether or not

the data management plans required and approved by the National Institutes of Health

Page 48: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 48

and the National Science Foundation will be adequate for long-term data stewardship, at

least by the standards of information professionals.

THE APPLICATION OF POLICIES TO REPOSITORIES AND DATA This section briefly discusses the automation of preservation policies and the

application of policies to data curation.

The Automation of Preservation Management Policies

How can information professionals tame the data deluge while stewarding data?

One way is for these professionals to take human-readable data stewardship policies

and implement them at the machine-level (Rajasekar, et al., 2006; Moore, 2005). This

"policy virtualization" is discussed in a previous section, "Managing Data: the Emergence

& Development of Digital Curation & Digital Preservation Standards", and an example is

presented.

Reagan W. Moore has stated that the challenge to virtualizing human-readable

policies to machine-readable code is that most groups cannot prove that they are doing

what they say they are doing (personal communication, January 6, 2012). This is a

known problem, as Waters and Garrett (1996) stated in the Executive Summary of the

their seminal report that archives must be able to prove that "they are who they say they

are by meeting or exceeding the standards and criteria of an independently-administered

program".

Moore & Smith (2007) automated the validation of Trusted Digital Repository

(TDR) assessment criteria. They created four levels of assessment criteria mapped to

TDR Assessment Criteria: Enterprise Level Rules, such as descriptive metadata;

Page 49: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 49

Archives Level Rules, such as consistency rules for persistent identifiers; Collection

Level Rules, such as flags for service level agreements; and, Item Level Rules, such as

periodic rule checks for format consistency. The authors implemented these rules using

iRODS with DSpace.

The researchers successfully demonstrated that preservation policies could be

implemented automatically at the machine level, and that an administrator could audit

the system and prove that the TDR assessment criteria have been successfully

implemented. In other words, Moore & Smith (2007) were able to prove that they are

preserving what they have said they are preserving by virtualizing human-readable

policies to machine-readable code. One application of Moore, et al.'s work includes the

SHAMAN (2011) project. These researchers have also successfully implemented an

automated preservation system by virtualizing policies using iRODS (Moore, et al.,

2007).

Another method is to encode all metadata with the object itself. Gladney and

Lorie (2005) and Gladney (2004) have proposed the creation of durable objects in which

all relevant information is encoded with the object itself. This was briefly discussed in a

previous essay, "Managing Data: Preservation Standards and Audit and Certification

Mechanisms (e.g., "policies")".

The Application of Policies to Data and Data Curation

Beagrie, Semple, Williams, and Wright (2008) outlined a model of digital

preservation policies and analyzed how those policies could underpin key strategies for

United Kingdom (UK) Higher Education Institutions (HEI). They mapped digital

Page 50: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 50

preservation links to other key strategies for other higher education institutions, such as

records management policies. They also examined current digital preservation policies

and modeled a digital preservation policy. The authors proposed that funders use their

study to evaluate the implementation of best practices within UK HEIs.

Similarly, Jones (2009) examined the range of policies required in HEIs for digital

curation in order to support open access to research outputs. She argued that curation

only begins once effective policies and strategies in place. She wanted to map then

current curation policies to pinpoint the areas that need further development and support

so that open access to research will be supported. The author wrote that the

implementation of curation policies in UK HEIs is patchy, although there have been

some improvements. She concluded that for effective digital curation of open access

research to occur, a robust infrastructure must be in place; financing and actual costs

must be determined; and, the differing roles and responsibilities must be defined and put

in place.

As noted earlier in this paper, research data has slightly different policy-

requirements than general digital library collections, such as ePrint archives. Green,

MacDonald, and Rice (2009) addressed those policy differences and created a planning

tool and decision-making guide for institutions with existing digital digital repositories

who may add research data (sets) to their collections.

The authors based the guide on the OAIS Reference Model (CCSDS, 2002), the

Trusted Digital Repository Assessment Criteria (CCSDS, 2011) and the OpenDOAR

Policy Tool (Green, MacDonald, and Rice, 2009). They addressed policies related to

Page 51: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 51

datasets, primarily social science, but they included policies for content such as grey

literature, video and audio files, images, and other non-traditional scholarly publications.

The authors designed the guide with the idea of supporting sound data management

practice, data sharing, and long-term access in a simplified format.

Thus, sound, strategically applied policies must underpin the efforts to steward

data for the indefinite long-term, whether they are applied at the machine-level, or via

human effort.

SUMMARY: THE IMPLICATIONS FOR THE LONG-TERM STEWARDSHIP OF RESEARCH DATA

Research data management is in flux, much like early digital libraries. In spite of

all of this work to create standards, and various funder requirements, some data will be

lost. The questions are: how much data will be lost; by whom; whether or not the data is

replaceable; and, how valuable is having the actual data set itself, versus knowing the

reported results of any published analysis of the lost data set(s)? It is also likely that

some data sets will languish, unused but very carefully curated.

Having said that, much less data will be lost than if no repository and policy

standards, and funder requirements, had been created and required in the first place.

Standards and funder requirements can only do so much; the data creators themselves

have to want to ensure the data is shareable and accessible for the long-term, and the

infrastructure to do so must be in place for them to do so. This infrastructure includes not

only the physical hardware and software, but also defined policies, standards, metadata,

funding, and, roles and responsibilities, among others.

Page 52: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 52

First among this infrastructure must be explicit incentives for researchers to take

the time to annotate and clean up the data and any related software and scripts for re-

use -- or to take the time to ensure someone else does it for them. Information

professionals must provide the data stewardship services, but it is up to the data

creators to provide the data.

The final conclusion is that researchers want to focus on creating and analyzing

data. Some researchers care about the long-term stewardship of their data, while others

do not. It remains to be seen whether or not funders' requirements for data sharing will

impact how much data is actually made available for re-purposing, re-use, and,

preservation.

Effective data stewardship requires not just technical and standards-based

solutions, but also people, financial, and managerial solutions. As the old proverb states,

"You can lead a horse to water, but you cannot make him drink" (Speake & Simpson,

2008).

Page 53: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 53

REFERENCES

Anderson, C. (2008, October 23). The end of theory: the data deluge makes the scientific method obsolete. Wired, 16.07. Retrieved November 18, 2010, from http://www.wired.com/science/discoveries/magazine/16-07/pb_theory Ball, A. (2012). Review of Data Management Lifecycle Models. Project Report. Bath, UK: University of Bath. Retrieved March 10, 2012, from http://opus.bath.ac.uk/28587/1/redm1rep120110ab10.pdf Ball, A. (2010). Review of the State of the Art of the Digital Curation of Research Data. Project Report. Bath, UK: University of Bath, (ERIM Project Document erim1rep091103ab12). Retrieved January 25th, 2012, from http://opus.bath.ac.uk/19022/ Beagrie, C. & JISC. (2010). Keeping Research Data Safe Factsheet Cost issues in digital preservation of research data. Charles Beagrie Ltd and JISC. Retrieved September 29, 2010 from http://www.beagrie.com/KRDS_Factsheet_0910.pdf Beagrie, C., Chruszcz, J. & Lavoie, B. (2008). Keeping Research Data Safe. JISC. Retrieved September 9, 2009, from http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx Beagrie, N., Chruszcz, J. & Lavoie, B. (2008). Executive summary. In Keeping research data safe. JISC. Retrieved January 24, 2009, from http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx Beagrie, N., Semple, N., Williams, P. & Wright, R. (2008). Digital Preservation Policies Study Part 1: Final Report October 2008. Salisbury, UK: Charles Beagrie, Limited. Retrieved January 24, 2012 from http://www.jisc.ac.uk/media/documents/programmes/preservation/jiscpolicy_p1finalreport.pdf Blue Ribbon Task Force on Sustainable Digital Preservation and Access. (2008, December). Sustaining the digital investment: issues and challenges of economically sustainable digital preservation. San Diego, CA: San Diego Supercomputer Center. Retrieved January 24, 2009, from http://brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf Borgman, C.L. (2008). Data, disciplines, and scholarly publishing. Learned Publishing, 21, 29-38. Retrieved January 25, 2012, from http://www.ingentaconnect.com/content/alpsp/lp/2008/00000021/00000001/art00005 Borgman, C.L. (2010). Research Data: Who will share what, with whom, when, and why? Fifth China-North America Library Conference, September 8-12, 2010, Beijing, China. Retrieved December 15, 2010, from http://works.bepress.com/borgman/238

Page 54: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 54

Burgess, C. (2011, January 31). Your Name, Your Privacy, Your Digital Exhaust. Infosec Island. Retrieved March 7, 2011, from http://infosecisland.com/blogview/11450-Your-Name-Your-Privacy-Your-Digital-Exhaust.html California Digital Library. (2012). DMPTool. Retrieved February 12, 2012, from https://dmp.cdlib.org/ Carlson, J. (2012). Demystifying the data interview: developing a foundation for reference librarians to talk with researchers about their data. Reference Services Review, 40(1), 7-23. Retrieved February 9, 2012, from http://dx.doi.org/10.1108/00907321211203603 CCSDS. (2011). Audit and certification of trustworthy digital repositories recommended practice (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA). CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/ Cragin, M.H., Palmer, C.L., Carlson, J.R. & Witt, M. (2010). Data sharing, small science, and institutional repositories. Philosophical Transactions of the Royal Society, 368, 4023-4038. Curry, A. (2011). Rescue of Old Data Offers Lesson for Particle Physicists. Science, 331, 694-695. Digital Curation Centre. (2012). DMPOnline. Retrieved February 12, 2012, from http://www.dcc.ac.uk/dmponline Editor. (2009). Data's shameful neglect. Nature, 461, 145. Editor. (2005). Let data speak to data. Nature, 438, 531. Evans, J.A. & Foster, J.G. (2011). Metaknowledge. Science, 331, 721-725. Foster, N.F. & Gibbons, S. (2005). Understanding Faculty to Improve Content Recruitment for Institutional Repositories. D-Lib Magazine, 11(1). Retrieved March 8, 2012, from http://www.dlib.org/dlib/january05/foster/01foster.html Fry, J., Lockyer, S., Oppenheim, C., Houghton, J., & Rasmussen, B. (2008). Identifying benefits arising from the curation and open sharing of research data produced by UK

Page 55: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 55

Higher Education and research institutes (Final report). London: JISC. Retrieved January 25, 2012, from http://ie-repository.jisc.ac.uk/ 279/ Gallagher, S. (2012, January). The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data. Ars Technica. Retrieved March 7, 2012, from http://arstechnica.com/business/news/2012/01/the-big-disk-drive-in-the-sky-how-the-giants-of-the-web-store-big-data.ars Gantz, J. & Reinsel, D. (2011). Extracting Value from Chaos. IDC #1142. Retrieved February 21, 2012, from http://idcdocserv.com/1142 Ginsparg, P. (2011). ArXiv at 20. Nature, 476, 145-147. Gladney, H.M. & Lorie, R.A. (2005). Trustworthy 100-Year digital objects: durable encoding for when it is too late to ask. ACM Transactions on Information Systems, 23(3), 229-324. Retrieved December 17, 2011, from http://eprints.erpanet.org/7/ Gladney, H.M. (2004). Trustworthy 100-Year digital objects: evidence after every witness is dead. ACM Transactions on Information Systems, 22(3), 406-436. Retrieved July 12, 2008, from http://doi.acm.org/10.1145/1010614.1010617 Gray, N., Carozzi, T., & Woan, G. (2011). Managing Research Data -- Gravitational Waves. Draft final report to the Joint Information Systems Committee (JISC). University of Glasgow: Research Data Management Planning (RDMP). Retrieved March 3, 2011, from https://dcc.ligo.org/public/0021/P1000188/006/report.pdf Green, A., Macdonald, S., & Rice, R. (2009). Policy-making for research data in repositories: a guide. Edinburgh, UK: University of Edinburgh. Hanson, B., Sugden, A., & Alberts, B. (2011). Making Data Maximally Available. Science, 331, 649. Harley, D., Acord, S.K., Earl-Novell, S., Lawrence, S., & King, C.J. (2010). Assessing the Future Landscape of Scholarly Communication: An Exploration of Faculty Values and Needs in Seven Disciplines - Executive Summary. UC Berkeley: Center for Studies in Higher Education. Retrieved January 23, 2012, from http://escholarship.org/uc/item/0kr8s78v Hey, T. and Trefethen, A. (2003). The Data Deluge: An e-Science Perspective. In F. Berman, G. Fox, and A. Hey (Eds.), Grid Computing – Making the Global Infrastructure a Reality (pp. 809-824). Chichester, England: John Wiley & Sons. Retrieved January 23, 2012, from http://eprints.ecs.soton.ac.uk/7648/

Page 56: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 56

Hilbert, M. & López, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute. Science Express, 332(6025), 60-65. Hough, M.G. (2009). Keeping it to ourselves: technology, privacy, and the loss of reserve. Technology in Society, 31, 406-413. Retrieved February 1, 2010, from http://libproxy.lib.unc.edu/login?url=http://dx.doi.org/10.1016/j.techsoc.2009.10.005 Hswe, P. & Holt, A. (2010). Guide for Research Libraries: The NSF Data Sharing Policy. E-Science. Association of Research Libraries. Retrieved January 6, 2012, from http://www.arl.org/rtl/eresearch/escien/nsf/index.shtml Interagency Working Group on Digital Data. (2009). Harnessing the power of digital data for science and society. Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council. Washington, DC: Office of Science and Technology Policy. Retrieved April 9, 2009, from http://www.nitrd.gov/about/Harnessing_Power_Web.pdf Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (5th ed.). Ann Arbor, MI. Retrieved January 5, 2012, from http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/access/deposit/guide/ Inter-university Consortium for Political and Social Research (ICPSR). (2011). Elements of a Data Management Plan. Data Deposit and Findings. Ann Arbor, MI: University of Michigan, Institute for Social Research. Retrieved March 10, 2012, from http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/dmp/elements.html Jones, S. (2011). Summary of UK research funders’ expectations for the content of data management and data sharing plans. University of Glasgow: Digital Curation Centre (DCC). Retrieved January 26, 2012, from http://www.dcc.ac.uk/webfm_send/499 Jones, S. (2009). A report on the range of policies required for and related to digital curation. DCC Policies Report, v. 1.2. University of Glasgow: Digital Curation Centre. Retrieved January 26, 2012, from http://www.dcc.ac.uk/webfm_send/129 Kling, R. & McKim, G.W. (2000). Not just a matter of time: field differences and the shaping of electronic media in supporting scientific communication. Journal of the American Society for Information Science and Technology, 51(14), 1306-1320. Lazorchak, B. (2011). Digital Preservation, Digital Curation, Digital Stewardship: What’s in (Some) Names? Retrieved March 11, 2012, from http://blogs.loc.gov/digitalpreservation/2011/08/digital-preservation-digital-curation-digital-stewardship-what’s-in-some-names/

Page 57: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 57

Lord, P. & Macdonald, A. (2003). Data curation for e-Science in the UK: An audit to establish requirements for future curation and provision (E-Science Curation Report). London: JISC. Retrieved January 26th, 2012 from http://www.jisc.ac.uk/media/documents/programmes/preservation/e-science reportfinal.pdf Lynch, C. (2008). How do your data grow? Nature, 455, 28-29. Lynch, C. (2008). The institutional challenges of cyberinfrastructure and e-research. Educause Review, 43(6). Washington, DC: Educause. Retrieved January 22, 2009, from http://www.educause.edu/EDUCAUSE+Review/EDUCAUSEReviewMagazineVolume43/TheInstitutionalChallengesofCy/163264 Lyon, L. (2007). Dealing with Data: Roles, Rights, Responsibilities and Relationships. Consultancy Report. University of Bath: UKOLN. Retrieved January 10, 2012, from http://opus.bath.ac.uk/412/ Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. & Byers, A.H. (2011). Big data: the next frontier for innovation, competition, and productivity. Report. Seoul: McKinsey Global Institute. Retrieved June 1, 2011, from http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation Moore, R., Rajasekar, A., & Marciano, R. (2007). Implementing Trusted Digital Repositories. In Proceedings of the DigCCurr2007 International Symposium in Digital Curation, University of North Carolina - Chapel Hill, Chapel Hill, NC USA, 2007. Retrieved September 24, 2010, from http://www.ils.unc.edu/digccurr2007/papers/moore_paper_6-4.pdf Moore, R. (2005). Persistent collections. In S.H. Kostow & S. Subramaniam (Eds.), Databasing the brain: from data to knowledge (neuroinformatics) (pp. 69-82). Hoboken, NJ: John Wiley and Sons. Moore, R. & Smith, M. (2007). Automated Validation of Trusted Digital Repository Assessment Criteria. Journal of Digital Information, 8(2). Retrieved March 2, 2010, from http://journals.tdl.org/jodi/article/view/198/181 Narayanan, A. & Shmatikov, V. (2007). How To Break Anonymity of the Netflix Prize Dataset. Retrieved March 7, 2012, from http://arxiv.org/abs/cs/0610105 National Aeronautics and Space Administration. (2010). The National Aeronautics

Page 58: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 58

and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels. NASA Science Earth. Retrieved March 14, 2012, from http://science.nasa.gov/earth-science/earth-science-data/data-processing-levels-for-eosdis-data-products/ National Institutes of Health. (2010). Data Sharing Policy. NIH Grants Policy Statement (10/10) - Part II: Terms and Conditions of NIH Grant Awards, Subpart A: General - File 6 of 6. Retrieved March 7, 2012, from http://grants.nih.gov/grants/policy/nihgps_2010/nihgps_ch8.htm#_Toc271264951 National Institutes of Health. (2003). NIH Data Sharing Policy and Implementation Guidance. Grants Policy. Retrieved March 7, 2011, from http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm#fin National Research Council. (2010). Steps toward large-scale data integration in the science summary of a workshop. Reported by S. Weidman and T. Arrison, National Research Council. Washington, D.C.: The National Academies Press. National Science Board. (2011). Digital Research Data Sharing and Management. NSB-11-79, December 14, 2011. Arlington, VA: National Science Board. Retrieved January 18, 2012, from http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf National Science Foundation. (2011). NSF 11-1 January 2011 Chapter II - Proposal Preparation Instructions. Grant Proposal Guide. Retrieved January 16, 2011, from http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp National Science Foundation. (2011). Dissemination and Sharing of Research Results. NSF Data Sharing Policy. Retrieved January 15, 2011, from http://www.nsf.gov/bfa/dias/policy/dmp.jsp National Science Foundation. (2010). Data Management for NSF Engineering Directorate Proposals and Awards. Engineering (National Institutes of HealthENG), the National Science Foundation. Retrieved September 2, 2010 from http://nsf.gov/eng/general/ENG_DMP_Policy.pdf National Science Foundation. (2005). Long-lived digital data collections enabling research and education in the 21st century (NSB-05-40). Arlington, VA: National Science Foundation. Retrieved May 5, 2008, from http://www.nsf.gov/pubs/2005/nsb0540/ National Science Foundation Cyberinfrastructure Council. (2007). Cyberinfrastructure vision for 21st century discovery (NSF 07-28). Arlington, VA: National Science Foundation. Retrieved November 12, 2007, from http://www.nsf.gov/pubs/2007/nsf0728/index.jsp

Page 59: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 59

Nelson, B. (2009). Data sharing: empty archives. Nature, 461, 160-163. Pienta, A.M., Alter, G. & Lyle, J. (2010). The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data. Paper presented at the workshop on “the Organisation, Economics and Policy of Scientific Research”, held in in April 23-24, 2010, Torino, Italy. Retrieved January 5, 2012, from http://deepblue.lib.umich.edu/handle/2027.42/78307 Rajasekar, A., Wan, M., Moore, R. & Schroeder, W. (2006). A prototype rule-based distributed data management system. Paper presented at a workshop on “next generation distributed data management" at the High Performance Distributed Computing Conference, June 19-23, 2006, Paris, France. Research Information Network, Institute of Physics, Institute of Physics Publishing, & Royal Astronomical Society. (2011). Collaborative yet independent: information practices in the physical sciences. A Research Information Network Report. London, UK: Research Information Network, December 2011. Retrieved January 26, 2012, from http://www.iop.org/publications/iop/2012/page_53560.html Research Information Network. (2011). Data centres: their use, value, and impact. A Research Information Network report. London, UK: Research Information Network, September 2011. Research Information Network. (2009). Patterns of information use and exchange: case studies of researchers in the life sciences. A Research Information Network Report. London, UK: Research Information Network, November 2009. Retrieved January 25, 2012, from http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/patterns-information-use-and-exchange-case-studie Research Information Network. (2008). Stewardship of digital research data: A framework of principles and guidelines. A Research Information Network report. London, UK: Research Information Network, January 2008. Research Information Network. (2008). To Share or not to Share: Publication and Quality Assurance of Research Data Outputs. A Research Information Network report. London, UK: Research Information Network, June 2008. Schofield, P.N., Bubela, T., Weaver, T., Portilla, L., Brown, S.D., Hancock, J.M., Einhorn, D., Tocchini-Valentini, G., Hrabe de Angelis, M., Rosenthal, N. & CASIMIR Rome Meeting participants. (2009). Post-publication sharing of data and tools. Nature, 461, 171-173.

Page 60: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 60

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences. SHAMAN. (2011). Automation of Preservation Management Policies. SHAMAN – WP3-D3.4 (Report). Seventh Framework Programme and European Union. Soehner, C., Steeves, C., Ward, J. (2010, August). E-science and data support services: a study of ARL member institutions. Washington, D.C.: Association of Research Libraries. Retrieved November 18, 2010, from http://www.arl.org/bm~doc/escience_report2010.pdf Solove, D.J. (2007). "I've got nothing to hide" and other misunderstandings of privacy. San Diego Law Review, 44, 745-772. Speake, J. & Simpson, J. (2008). Oxford Dictionary of Proverbs. Oxford, UK: Oxford University Press. Stewardship. (2012). ForestInfo.org. Dovetail Partners, Inc. Retrieved March 9, 2012, from http://bit.ly/zmNzy1 Stewardship. (2012). Free Merriam-Webster Dictionary. An Encyclopaedia Brittannica Company. Retrieved March 9, 2012, from http://www.merriam-webster.com/dictionary/stewardship Sullivan, B. (2012, March 6). Govt. agencies, colleges demand applicants' Facebook passwords. MSNBC. Retrieved March 7, 2012, from http://redtape.msnbc.msn.com/_news/2012/03/06/10585353-govt-agencies-colleges-demand-applicants-facebook-passwords Swan, A. & Brown, S. (2008). The skills, role and career structure of data scientists and curators: an assessment of current practice and future needs report to JISC. Truro, UK: Key Perspectives, Ltd. Retrieved January 18, 2012, from http://www.jisc.ac.uk/publications/reports/2008/dataskillscareersfinalreport.aspx Sweeney, L. (2002). K-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557-570. Tickletux. (2007). Did Bill Gates say the 640k line ? Retrieved from http://imranontech.com/2007/02/20/did-bill-gates-say-the-640k-line/

Page 61: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 61

Toronto International Data Release Workshop Authors. (2009). Prepublication data sharing. Nature, 461, 168-170. UKRDS. (2008). UKRDS interim report UKRDS the UK research data service feasibility study (v0.1a.030708). London: Serco Ltd. Retrieved April 9, 2009, from http://www.ukrds.ac.uk/UKRDS%20SC%2010%20July%2008%20Item%205%20(2).doc Van den Eynden, V., Corti, L., Woollard, M., Bishop, L. & Horton, L. (2011). Managing and Sharing Data: Best Practices for Researchers, 3rd edition. University of Essex: UK Data Archive. Retrieved January 5, 2012, from http://www.data-archive.ac.uk/media/2894/managingsharing.pdf Walters, T. & Skinner, K. (2011). New roles for new times: digital curation for preservation. Report prepared for the Association of Research Libraries. Washington, D.C.: Association of Research Libraries. Retrieved April 2, 2011, from http://www.arl.org/bm~doc/nrnt_digital_curation17mar11.pdf Waters, D. and Garrett, J. (1996). Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, DC: CLIR, May 1996.

Page 62: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 62

APPENDIX A

The following tables (figures) of organizations, individuals, roles, sectors, and types involved with data management are from the Interagency Working Group on Digital Data (2009). 1. Entities by Role 2. Entities by Individual 3. Entities by Sector 4. Individuals by Role 5. Individuals by Life Cycle Phase/Function 6. Entities by Life Cycle Phase/Function

Page 63: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 63

Figure6-EntitiesbyRole(InteragencyWorkingGrouponDigitalData,2009).

Page 64: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 64

Figure7-EntitiesbyRole(InteragencyWorkingGrouponDigitalData,2009).

Page 65: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 65

Figure8-EntitiesbyRole(InteragencyWorkingGrouponDigitalData,2009).

Page 66: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 66

Figure9-EntitiesbyIndividuals(InteragencyWorkingGrouponDigitalData,2009).

Page 67: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 67

Figure10-EntitiesbySectorwithfootnotes(InteragencyWorkingGrouponDigitalData,2009).

Page 68: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 68

Figure11-IndividualsbyRole(InteragencyWorkingGrouponDigitalData,2009).

Figure12-IndividualsbyLifeCyclePhase/Function(InteragencyWorkingGrouponDigitalData,2009).

Page 69: CITATION Ward, J.H. (2012). Managing Data: the Data Deluge ......the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions,

31 March 2012 Literature Review #5 Jewel H. Ward

Managing Data: the Data Deluge and the Implications for Data Stewardship, v. final 69

Figure13-EntitiesbyLifeCyclePhase/Function(InteragencyWorkingGrouponDigitalData,2009).