57
Prepared for NISO Forum: Tracking it Back to the Source: Managing and Citing Research Data September 2012 Needs for Data Management & Citation Throughout the Information Lifecycle Micah Altman Director of Research, MIT Libraries

Needs for Data Management & Citation Throughout the Information Lifecycle

Embed Size (px)

DESCRIPTION

Prepared for the NISO Forum: Tracking it Back to the Source: Managing and Citing Research Data. September 2012

Citation preview

Page 1: Needs for Data Management & Citation Throughout  the Information Lifecycle

Prepared for

NISO Forum: Tracking it Back to the Source: Managing and Citing Research Data

September 2012

Needs for Data Management & Citation Throughout the Information

Lifecycle

Micah Altman

Director of Research, MIT Libraries

Page 2: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 2

Collaborators and Co-Conspirators• Jonathan Crabtree, Merce Crosas, Gary King, Tom

Lipkis, Nancy McGovern, John Willinsky

• Research Support– Library of Congress (PA#NDP03-1),– National Science Foundation (DMS-0835500, SES

0112072)– Institute for Museum and Library Services (LG-05-09-

0041-09)– Sloan Foundation – Amazon Web Services – Massachusetts Institute of Technology

Page 3: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 3

Related WorkReprints available from: http://maltman.hmdc.harvard.edu

• Altman, M. 2012. Data Citation in The Dataverse Network ®. In P. F. Uhlir (Ed.), Developing Data Attribution and Citation Practices and Standards: Report from an International Workshop (p. Forthcoming). National Academies Press. Forthcoming.

• Altman, M., & Crabtree, J. 2011. Using the SafeArchive System : TRAC-Based Auditing of LOCKSS. Archiving 2011 (pp. 165–170). Society for Imaging Science and Technology.

• M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American Archivist. 72(1): 169-182 M. Altman, 2008, "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer-Verlag.

• M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April).

Page 4: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 4

Preview

• Principled approach to data management• Lifecycle data management planning • Lifecycle data management tracking• Lifecycle data management infrastructure• [Exemplar Projects]

Page 5: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 5

(Some) Timely Challenges

Page 6: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 6

“Data science is suddenly sexy – does that mean data is the new

black?”

Page 7: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 7

Valuable Data is Lost Examples

Intentionally Discarded: “Destroyed, in accord with [nonexistent] APA 5-year post-publication rule.”

Unintentional Hardware Problems “Some data were collected, but the data file was lost in a technical malfunction.”

Acts of Nature The data from the studies were on punched cards that were destroyed in a flood in the department in the early 80s.”

Discarded or Lost in a Move “As I retired …. Unfortunately, I simply didn’t have the room to store these data sets at my house.”

Obsolescence “Speech recordings stored on a LISP Machine…, an experimental computer which is long obsolete.”

Simply Lost “For all I know, they are on a [University] server, but it has been literally years and years since the research was done, and my files are long gone.”

Research by:

• Researchers lack archiving capability

• Incentives for data sharing are weak

Page 8: Needs for Data Management & Citation Throughout  the Information Lifecycle

8

Unpublished Data Ends up in the “Desk Drawer”

Needs for Data Management & Citation

Daniel Schectman’s Lab Notebook

Providing Initial

Evidence of Quasi Crystals

• Null results are less likely to be published• Outliers are routinely discarded

Page 9: Needs for Data Management & Citation Throughout  the Information Lifecycle

9

Data Behind Publications Unavailable for Review, Reuse, Replication

Needs for Data Management & Citation

Page 10: Needs for Data Management & Citation Throughout  the Information Lifecycle

10Needs for Data Management & Citation

Model Science

“Citations to unpublished data and personal communications cannot be used to support

claims in a published paper”

“All data necessary to understand, assess, and extend the conclusions of the

manuscript must be available to any reader of Science.”

Page 11: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 11

Compliance with Policies is Low Compliance is low even in

best examples of journals Checking compliance

manually is tedious, doesn’t scale

Page 12: Needs for Data Management & Citation Throughout  the Information Lifecycle

12

Special Challenges for Long-Term Access to New Forms of Data

• Some Examples– GIS and geospatial trails– Facebook & social networks– Text: blogs, tweets– Cell phone data

• Challenges– Proprietary – intellectual

property– Size– Dynamic content– Fixity– Format Needs for Data Management & Citation

Source: [Calberese 2008]

Page 13: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 13

A Lifecycle Framework

Page 14: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 14

“The published article is not scientific output –

it’s a summary of scientific output.”

-- corollary of Buckheit & Donaho 1995

Page 15: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 15

Creation/Collection

Storage/Ingest

Processing

Internal SharingAnalysis

External dissemination/publicati

on

Re-use• Scientific• Educational• Scientometric• Institutional

Long-term access

Information LifecycleM

odel

ing

Page 16: Needs for Data Management & Citation Throughout  the Information Lifecycle

Creation/Collection

Storage/Ingest

Processing

Internal SharingAnalysis

External dissemination/

publication

Re-use

Long-term

access

Stakeholders

Scholarly Publishers

Researchers

Data Archives/Publisher

Research Sponsors

Data Sources/Su

bjectsConsumers

Service/Infrastructure

Providers

Research Organizations

Needs for Data Management & Citation 16

Mod

elin

g

Page 17: Needs for Data Management & Citation Throughout  the Information Lifecycle

Legal Requirements and Rights

Contract Intellectual Property

Access Rights

Confidentiality

Copyright

Fair Use

DMCA

Database Rights

Moral Rights

Intellectual Attribution

Trade SecretPatent

Trademark

Common Rule

45 CFR 26HIPAA

FERPA EU Privacy Directive

Privacy Torts

(Invasion, Defamation)

Rights of Publicity

Sensitive but

Unclassified

Potentially Harmful

(Archeological Sites, Animal

Testing, …)Classifie

d

FOIA

CIPSEA

State Privacy Laws

EAR

State FOI

Laws

Journal Replication

Requirements

Funder Open Access

Contract

License

Click-WrapTOU

ITAR

Mod

elin

g

Page 18: Needs for Data Management & Citation Throughout  the Information Lifecycle

Stakeholders, Rights and Requirements

Contract Intellectual Property

Access Rights

Confidentiality

Copyright

Fair Use

DMCA

Database Rights

Moral Rights

Intellectual Attribution

Trade Secret

Patent

Trademark

Common Rule

45 CFR 26HIPAA

FERPA EU Privacy Directive

Privacy Torts

(Invasion, Defamation)

Rights of Publicity

Sensitive but

Unclassified

Potentially Harmful

(Archeological Sites, Animal

Testing, …)

Classified

FOIAHIPAA

CIPSEAState

Privacy Laws

FERPAState

FOI Laws

Journal Replication

Requirements

Funder Open Access

Contract

License

Click-WrapTOU

Scholarly Publisher

s

Primary Researchers

Consumers- Secondary research- Participative Science- - Public policy uses

Infrastructure/Service Providers

Research Organizations

Data Archives

Sources/Subjects

Research Sponsors

Mod

elin

g

Page 19: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 19

Stakeholder Drivers per Stage of Information Lifecycle

Stage Actors Legal Constraint Concerns

Research Proposal, Design and Data Collection

Subjects - Consent/contract - Public benefit- Privacy- Future access to own

informationSources - Intellectual

Property- Contract

- Business confidentiality- IP- Profit from licenses

Funder - Open Access- Confidentiality

- Public benefit- Policy Relevance- Reproducible Research- Future access

Primary Researcher

- Confidentiality- Contract- IP

- Publication potential- Compliance with

institutional/funder requirements

Research Institution

- Confidentiality- Contract- IP

- Compliance with funder requirements

- License, IP, confidentiality compliance

Mod

elin

g

Page 20: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 20

Stakeholder Drivers per Stage of Information Lifecycle

Stage Actors Legal Constraint Concerns

Data Storage, Analysis(Pre-publication)

Primary Researcher

- Confidentiality- Contract- IP

- Publication potential- Compliance with

institutional/funder requirements

Research Institution

- Confidentiality- Contract- IP

- License, IP, confidentiality compliance

- Records managementService Providers

- Contract- (Selected Cases)

Confidentiality Requirements

- Contract- Service business

model- Service deployment

Mod

elin

g

Page 21: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 21

Stakeholder Drivers per Stage of Information LifecycleStage Actors Legal Constraint Concerns

Publication Primary Researcher

Compliance for: - Source/subjects- Sponsor- Host institution- Publisher

- Scholarly attribution/credit- Promote use of research- Track use/impact of research

Sponsor - Track research products- Track compliance- Track use/impact

Research Institution

- Sponsor compliance - Track OA products- Records management- Intellectual property

Scholarly /Journal Publisher

- IP- Contract

- Impact/use- Profit/business model- Replicability

Data Publisher

- IP - Profit/business model- Replicability- Connection to publication

Mod

elin

g

Page 22: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 22

Stakeholder Drivers per Stage of Information Lifecycle

Stage Actors Legal Constraint Concerns

Re(use) Research Reader

- Access Rights - Provenance

Secondary Researcher

- Access rights- Confidentiality- Contract

- Replicability- Data reintegration/reanalysis- Linking publications and data- Provenance

“Citizen/Community Scientist”

Access Rights - Data redissemination/reanalysis

- Linking publications and dataPublic Policy Access Rights - Provenance

- Replicability- Linking publications and data

Education /teaching

Access Rights - “Classroom” use- MOOC use

Mod

elin

g

Page 23: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 23

Lifecycle Management: Data Management Planning

Page 24: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 24

Some Formal “DMP” Requirements

• The Final NIH Statement on Sharing Research Data – was published in the NIH Guide on February 26, 2003.

“Starting with the October 1, 2003 receipt date, investigators submitting an NIH application seeking $500,000 or more in direct costs in any single year are expected to include a plan for data sharing or state why data sharing is not possible. “

– No later than the main findings from the final data set are accepted for publication

• NSF, All proposals must (as of 1/1/2011) include a data management plan. – Specific requirements vague, for the most part:

“will be determined by the community of interest through the process of peer review and program management.”

• Wellcome Trust: – “ will review data management and sharing plans, and any costs

involved in delivering them, as an integral part of the funding decision”

Plan

ning

Page 25: Needs for Data Management & Citation Throughout  the Information Lifecycle

25

DMP Goals• Orchestrate data for current use• Control disclosure• Compliance with contracts, regulations, law,

and policy• Maximize value of information assets• Ensure short term and long term

dissemination

Needs for Data Management & Citation

Plan

ing

Page 26: Needs for Data Management & Citation Throughout  the Information Lifecycle

26

DMP Elements• Orchestrate data for current use

– Quality Assurance– Storage, backup, replication, and versioning– Data Formats– Data Organization– Budget– Metadata and documentation

• Control disclosure– Access and Sharing– Intellectual Property Rights– Legal Requirements– Security

• Compliance with contracts, regulations, law, and policy

– Access and Sharing– Adherence– Responsibility– Ethics and privacy– Security

• Value of information assets

– Data description – Data value– Relation to collection– Relation to evidence base– Budget

• Ensure short term and long term dissemination

– Data description – Institutional Archiving Commitments– Audience– Access and Sharing– Data Formats– Data Organization– Metadata and documentation– Budget

Needs for Data Management & Citation

Plan

ning

Page 27: Needs for Data Management & Citation Throughout  the Information Lifecycle

27

DMP Details• Sharing

– Plans for depositing in an existing public database – Access procedures – Embargo periods– Access charges – Timeframe for access– Technical access methods– Restrictions on access

• Long term access(Preservation)

– Requirements for data destruction, if applicable– Procedures for long term preservation – Institution responsible for long-term costs of data preservation – Succession plans for data should archiving entity go out of existence

• Formats– Generation and dissemination formats and procedural justification– Storage format and archival justification– Format documentation

• Metadata and documentation– Internal and External Identifiers and Citations– Metadata to be provided– Metadata standards used– Planned documentation and supporting materials– Quality assurance procedures for metadata and documentation

• Data Organization – File organization– Naming conventions

• Storage, backup, replication, and versioning– Facilities– Methods– Procedures– Frequency– Replication– Version management– Recovery guarantees

• Security– Procedural controls– Technical Controls– Confidentiality concerns

– Access control rules– Restrictions on use

• Budget– Cost of preparing data and documentation– Cost of storage and backup– Cost of permanent archiving and access

• Intellectual Property Rights– Entities who hold property rights– Types of IP rights in data– Protections provided– Dispute resolution process

• Legal Requirements– Provider requirements and plans to meet them– Institutional requirements and plans to meet them

• Responsibility– Individual or project team role responsible for data management– Qualifications, certifications, and licenses of responsible parties

• Ethics and privacy– Informed consent– Protection of privacy– Data use agreements– Other ethical issues

• Adherence– When will adherence to data management plan be checked or demonstrated– Who is responsible for managing data in the project – Who is responsible for checking adherence to data management plan– Auditing procedures and framework

• Value of information assets– Project use value– Institutional audience and uses– Public audience and uses– Relation to institutional collection– Relation to disciplinary evidence base– Cost of re-creating data

Needs for Data Management & Citation

Plan

ning

Page 28: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 28

Approaching Requirement Overlap• Sanity-check DMP details with lifecycle questions:

– Who wants it? – What do they need it for? – When will it be used?

• Be conscious of elements that serve multiple goals / or lifecycle– Metadata/documentation– Identifiers– Budgets– Formats– IP Rights and confidentiality restrictions– Responsibilities/Adherence

• Use tracking tools and methods throughout lifecycleThis Way…

Plan

ning

Page 29: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 29

Lifecycle Management: Tracking

Page 30: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 30

What do we track?

What tools and methods provide technical leverage or incentives to management across lifecycle stages and among actors?

• Identification – identifiers, references, citations• Provenance – relationship of delivered data to history of inputs and

modifications and actors responsible for these ; revision control; versioning• Authenticity: assertions about the provenance of the records• Respect des fonds: assertions about the original organization of the records• Chain of custody: assertions about the ownership of the records• Integrity: assertions about the management of the records; fixity of bits; fixity

of semantics• Auditing: verification of properties & policy compliance

Trac

king

Sources: Bulleted list of attributes adapted from Moore 2008

Page 31: Needs for Data Management & Citation Throughout  the Information Lifecycle

31

Creation/Collection

Storage/Ingest

Processing

Internal SharingAnalysis

External dissemination/publicati

on

Re-use

Long-term access

Tracking Across Information Lifecycle

citation

identifiers

Metadata for:Integrity,

Provenance,Custody

Trac

king

Page 32: Needs for Data Management & Citation Throughout  the Information Lifecycle

32

Data Citation: a Point of Leverage• Services

– Identifiers to specific fixed versions of data are needed to establish unambiguous chains of provenance

– Identifiers that can be globally resolved to machine-understandable metadata and to identified object are needed to building generalized access and analysis services

– Persistence of identifiers are needed to maintain long-term access • Incentives

– Scholarly credit (intellectual attribution) is a large motivator for many researchers – citation creates incentive for researchers to publish data

– Scholars also comply with enforceable journal policies-- requiring data citation is a light-weight method to make data access policies auditable

– Impact/usage is a motivator for public research funders – data citation provides foundation for measures of usage and impact

Needs for Data Management & Citation

Trac

king

Page 33: Needs for Data Management & Citation Throughout  the Information Lifecycle

33

Emerging Practices for Data Citation

Needs for Data Management & Citation

• Publishers– OECD iLibrary– Thomson Reuters

Data Citation Index

• Data archives– Dataverse Network– Data-PASS

• Harmonization efforts– DataCite– NAS BRDI– ICSU/Co-Data

• Discipline specific approaches

Trac

king

Page 34: Needs for Data Management & Citation Throughout  the Information Lifecycle

34

Attribution• Provide scholarly attribution• Provide legal attribution• Identify contributors to data

Discovery• Locate data via identifier• Locate data integral to article• Locate works related to data –

articles, derivatives, sources

Persistence• Does evidence persists as long

as assertions based on it? • Is durability of evidence

transparent?

Access• Access to surrogate• On-line access to object• Machine understandability• Long-term understandability

Verification• Associate work with version of

evidence used• Verify fixity of bits• Verify fixity of information• Verify “authenticity” of work

Needs for Data Management & Citation

Identifier and Citation Use Cases

Page 35: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 35

Emerging Principles for Data Citation

• Data citations should be first class objects for publication -- appear with citations to other works; should be as easy to cite as other works

• Citations should persist and enable access to fixed version of data at least as long as citing work

• Citations should persist and enable access to fixed version of data at least as long as the citing work exists.

• Citations should support unambiguous attribution of credit to all contributors, possibly through the citation ecosystem.

Trac

king

Page 36: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 36

Fixity• Are files, bitstreams corrupted?• Do semantics remain the same over time, across formats, software

analysis systems?Some semantic approaches…

Universal Numeric Fingerprint - Canonicalization Perceptual Signatures – Characterization of Significant Properties

Trac

king

Page 37: Needs for Data Management & Citation Throughout  the Information Lifecycle

Audit [aw-dit]:

An independent evaluation of records and activities to assess a system of controls

Fixity mitigates risk only if used for auditing.

Trac

king

Page 38: Needs for Data Management & Citation Throughout  the Information Lifecycle

Example:Functions of Storage Auditing

• Detect corruption/deletion of content

• Verifycompliance with storage/replication policies

• Prompt repair actions

Trac

king

Page 39: Needs for Data Management & Citation Throughout  the Information Lifecycle

Audit Design Choices• Audit regularity and coverage:

on-demand (manually); on event; randomized sample; scheduled/comprehensive

• Audit procedure, algorithms, certifying authority

• Auditing scope:integrity of object; integrity of collection; integrity of network; policy compliance; public/transparent auditing

• Trust model• Threat model

Trac

king

Page 40: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 40

Lifecycle Management: Infrastructure

Page 41: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 41

Many Tools, Few Solutions

• Many scientific tools are embedded in needs, perspectives, and practices of specific disciplines

• Identify common requirements• Identify gaps across lifecycle stages and among actors

“Poor carpenters blame their tools” –Proverb

“If all you have is a hammer, everything looks like a nail” – Another Proverb

“Ultimately, some people need holes – but no one needs a drill. ” – Yet Another Proverb

Infr

astr

uctu

re

Page 42: Needs for Data Management & Citation Throughout  the Information Lifecycle

42

Core Requirements for Data Sharing Infrastructure

Needs for Data Management & Citation

• Stakeholder incentives – recognition; citation; payment; compliance; services

• Dissemination– access to metadata; documentation; data

• Access control– authentication; authorization; rights management

• Provenance– chain of control; verification of metadata, bits, semantic content

• Persistence– bits; semantic content; use

• Legal protection– rights management; consent; record keeping; auditing

• Usability– discovery; deposit; curation; administration; collaboration

• Business modelSources: King 2007; ICSU 2004; NSB 2005

Infr

astr

uctu

re

Page 43: Needs for Data Management & Citation Throughout  the Information Lifecycle

Mind the Gaps

Needs for Data Management & Citation 43

Lifecycle Strengths Other Gaps

Scientific Workflow Software(e.g. Taverna)

- Close integration across supported lifecycle

- Perceived as useful service by researchers

- High Performance

- Discipline-centric- Doesn’t address most storage

requirements (replication, access control)

Storage Grid/VRE(e.g. Irods)

- Integration across supported lifecycle

- Storage is perceived as useful service by researchers

- High performance performance

- Loose integration of analysis, insufficient for reproducibility

Institutional Repository(e.g. Dspace)

- Low cost- Institutional commitment to

long-term access

- Access and discovery mechanisms usually tailored to publications, not data

Reproducible PublicationsSystems (e.g. StatWeave)

- Close integration of analysis and scientific publication

- Reduces risk of embarrassment when working with “co-authors”

- Ensures one form of reproducibility (calibration, mechanical replicability)

- Addresses replication but not reuse for secondary analysis, integration

“Data Archive” - Richer support for reuse- Often supports cross-discipline

discovery; long-term access

- Varied models – curated database; “virtual archive”, disciplinary repository

- Often discipline-centric

colle

ctio

n

sto

rag

e

an

aly

sis

dis

sem

ina

tion

reu

se

Page 44: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 44

Exemplar Efforts

(A.K.A., What have you done for me lately?)

Page 45: Needs for Data Management & Citation Throughout  the Information Lifecycle

45

• Audit Data Replication & Integrity Policies

Needs for Data Management & Citation

safearchive.org

Automatic Auditing of Data Replication & Integrity

PoliciesExam

plar

s

Page 46: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 46

The Distributed Content Replication Problem

• We hold digital assets we wish to preserve

• Many of these assets are not replicated

• Even when replicated, vulnerable to single points of failure because replicas are managed by single institution

A Partial Solution: LOCKSS Self-contained OSS Harvests resources via

open interfaces Replicated through secure

P2P protocol Self-repairing Zero trust Used by hundreds of

institution for collaborative preservation What we

needed…Auditing – how many replicates exist, where & are they current? Policy – prove replication are consistent with policy, like TRAC?Collaboration – coordinate with partners to replicate content? Integration – easily replicate content in institutional repositories?

Exam

plar

s

Page 47: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 47

Resilience of peer-to-peer with the Accountability of centralized

system

Facilitating collaborative replication and preservation with cyberinfrastructure … • Collaborators declare explicit non-uniform resource commitments• Policy records and schematizes commitments, desired TRAC replication properties• Storage layer provides replication, integrity, freshness, versioning • SafeArchive software provides monitoring, auditing, transparency, and provisioning • Content is harvested through HTTP (LOCKSS) or OAI-PMH• Integration of LOCKSS, Institutional Repositories, TRAC

Exam

plar

s

Page 48: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 48

ORCID is an international, interdisciplinary, open, and not-for-profit organization created for the benefit of all stakeholders, including research institutions, funding organizations, publishers, and researchers to enhance the scientific discovery process and improve collaboration and the efficiency of research funding.

ORCID aims to solve the name ambiguity problem in scholarly communications by creating a registry of persistent unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID, other ID schemes, and research objects such as publications, grants, and patents.

http://orcid.org

Exam

plar

s

Page 49: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 49

ORCID Launch to Public in OctoberORCID Launch Partners Program include research institutions, publishers, research funders, data repositories, and third party providers, such as:

The American Physical Society, Aries Systems, Avedas, Boston University, the California Institute of Technology, CrossRef, Elsevier, Faculty of 1000, figshare, Hindawi Publishing Corporation, KNODE, Nature Publishing Group, SafetyLit, Symplectic, Thomson Reuters, Total-Impact, and Wellcome Trust.

At Launch, the ORCID Registry will:

• Allow researchers and scholars to register for an ORCID identifier, create ORCID records, and manage their privacy settings

• Contain ORCID records created by universities on behalf of their researchers and scholars• Allow researchers and scholars to link their ORCID record external identifiers, including Scopus and

ResearcherID• Facilitate synchronization of ORCID identifier record data with external systems including Scopus• Bi-directionally link to a number of author profile and manuscript submission, including the

American Physical Society, Aries Systems, Hindawi Publishing Corporation, Nature Publishing Group, and Scholar One Manuscripts

• Allow researchers and scholars to search and upload publication metadata from CrossRef• (Soon after launch) have the ability to link to grant application systems

Exam

plar

s

Page 50: Needs for Data Management & Citation Throughout  the Information Lifecycle

50

Data Management Workflowsfor Open Access Journals

Needs for Data Management & Citation

+

http://bit.ly/DVNOJS

Exam

plar

s

Page 51: Needs for Data Management & Citation Throughout  the Information Lifecycle

51

Embed Real Data Archives in Journals• Embed remotely managed data

archive in OJS journal• Replaces “supplemental

materials”• Ads

– Online analysis– Independent storage– Persistent identifiers and citation– Data versioning– Enhanced discoverability and

interoperability– Format normalization– Fixity and replication

Needs for Data Management & Citation

Exam

plar

s

Page 52: Needs for Data Management & Citation Throughout  the Information Lifecycle

52

Integrated Policies, Workflow, Access• OJS and DVN

– Support workflows – Enforce policies– Disseminate content

• Integrate policies for– Access and data license– Embargoes– Citation

• Coordinate– Submission– Review– Publication

• Link– Content– Subscriptions & notifications– Usage Metrics

Needs for Data Management & Citation

DATAExam

plar

s

Page 53: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 53

Wrapping Up

Page 54: Needs for Data Management & Citation Throughout  the Information Lifecycle

54

How will we see the geography of science e, when we reveal how research connects through data?

Needs for Data Management & Citation

Research & Node Layout: Kevin Boyack and Dick Klavans (mapofscience.com); Data: Thompson ISI; Graphics & Typography: W. Bradford Paley (didi.com/brad); Commissioned Katy Börner (scimaps.org)

Seed Magazine, Mar 7, 2007http://seedmagazine.com/content/article/scientific_method_relationships_among_scientific_paradigms/

Page 55: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 55

Summary • Principled approach to data management

– Follow information through information lifecycle– Assess stakeholder requirements – Track management, use, impact across lifecycle

• Data management planning goals– Orchestrate data for current use– Protect against disclosure– Compliance with contracts, regulations, law, and policy– Maximize value of information assets– Ensure short term and long term dissemination

• Lifecycle data management tracking– Identification – identifiers, references, citations– Provenance – relationship of delivered data to history of inputs and modifications and actors responsible for these – Authenticity: assertions about the provenance of the records– Chain of custody: assertions about the ownership of the records– Integrity: assertions about the management of the records; fixity of bits; fixity of semantics– Auditing: verification of properties & policy compliance

• Data citation is a key leverage point– Services: establish provenance; access; long-term preservation– Incentives: scholarly credit; reproducible research policies; impact/usage analysis– Data citations should be first class objects for publication -- appear with citations to other

works; should be as easy to cite as other work

Page 56: Needs for Data Management & Citation Throughout  the Information Lifecycle

Needs for Data Management & Citation 56

Additional References• Buckheit J, Donoho DL. Wavelab and reproducible research. In:

Antoniadis A, editor. Wavelets and Statistics. New York, NY: Springer; 1995. p. 55-81.

• International Council For Science (ICSU) 2004. ICSU Report of the CSPR Assessment Panel on Scientific Data and Information. Report.

• King, Gary. 2007. "An Introduction to the Dataverse Network as an Infrastructure for Data Sharing." Sociological Methods and Research 36

• Moore, M. 2008, Towards a Theory of Digital Preservation, International Journal of Digital Curation 1(3)

• National Science Board (NSB), 2005, Long-Lived Digital Data Collections: Enabling Research and Education in the 21rst Century, NSF. (NSB-05-40).

Page 57: Needs for Data Management & Citation Throughout  the Information Lifecycle

DiscussionContact information:Web: http://micahaltman.com

E-mail: [email protected]

Twitter: @drmaltman