24
Prepared for MIT Libraries Informatics Program Brown Bag Talk August2013 Emerging Data Citation Infrastructure Dr. Micah Altman <[email protected]> Director of Research, MIT Libraries

Emerging Data Citation Infrastructure

Embed Size (px)

DESCRIPTION

Data citation supports attribution, provenance, discovery, provenance, and persistence. It is not (and should not be) sufficient for all of these things, but its an important component. In the last 2 years, there have been several major efforts to standardize data citation practices, build citation infrastructure, and analyze data citation practices. This session presented as part of the the Program on Information Science seminar series, examines data citation from an information lifecycle approach: what are the use cases, requirements and research opportunities. And the session will also discuss emerging infrastructure and standardization efforts around data citation. A number of principles have emerged for citation -- the most central is that data citations should be treated consistently with citations to other objects:Data citations should at least provide the minimal core elements expected in other modern citations; should be included in the references section along with citations to other elements; and indexed in the same way. Adoption of data citation by journals can provide positive and sustainable incentives for more reproducible science and more complete attribution. This would act to brighten the dark matter of science -- revealing connections among evidence bases that are not now visible through citations of articles.

Citation preview

Page 1: Emerging Data Citation Infrastructure

Prepared for

MIT Libraries Informatics Program Brown Bag Talk

August2013

Emerging Data Citation Infrastructure

Dr. Micah Altman<[email protected]>

Director of Research, MIT Libraries

Page 2: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.

Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.

Page 3: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Collaborators & Co-Conspirators• Merce Crosas, IQSS, Harvard U. • Data-PASS Steering Committee

<data-pass.org>• CODATA-ICSTI Task Group on Data Citation Standards and

Practices<www.codata.org/taskgroups/TGdatacitation/>

• Research Support – Thanks to the National Academies BRDI Sponsors:

Department of Energy (DOE). Institute of Museum and Library Services (IMLS), The Library of Congress (LOC). Microsoft Research. National Institute of Standards and Technology (NIST), National Institutes of Health (NIH),National Oceanic and Atmospheric Administration (NOAA), National Science Foundation (NSF). U.S. Geological Survey (USGS) & the Massachusetts Institute of Technology.

Page 4: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Related Work• CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013 , Out of

Cite, “Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data”, Data Science Journal. Forthcoming.

• P. F. Uhlir (Ed.), Developing Data Attribution and Citation Practices and Standards Report from an International Workshop (p. Forthcoming). National Academies Press.

• M. Altman,2008, "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag.

• Altman, M., & King, G. 2007. A Proposed Standard for the Scholarly Citation of Quantitative Data. DLib Magazine, 13(3/4)

Most reprints available from:informatics.mit.edu

Page 5: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

This Talk

• What is data citation? Why Cite?• Emerging Principles• On the horizon

Page 6: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

What’s Wrong with this Picture?

“To test Benet’s (1998) theory of “politically-induced intelligence” (Benet 1999, pg 8), use a hierarchical corrected contingency model (see Altman & Smith 2010; Edgeworth 1863). We apply this model to a snowball sample (Glass 1973) of eligible voters14, to which the standard Stanford-Binet (Stanford & Binet 1766) has been applied. Our results show that adoption of Pastafarrianism can be expected to yield an increase mean intelligence by 10.3 points. ”

13 We thank Jon Sample, Director of the institute of the Pastaffarian institute for supplying this dataset, which is available upon request.

Page 7: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

“How much slower would scientific progress be if the near universal standards for scholarly citation of articles and books had never been developed?Suppose shortly after publication only some printed works could be reliably found by other scholars; or if researchers were only permitted to read an article if they first committed not to criticize it, or were required to coauthor with the original author any work that built on the original … [If] printed works existed in different libraries under different titles; if researchers routinely redistributed modified versions of other authors' works without changing the title or author listed; or if publishing new editions of books meant that earlier editions were destroyed?...” – Altman & King 2007

Page 8: Emerging Data Citation Infrastructure

“Citations to unpublished data and personal communications cannot be used to support

claims in a published paper”

“All data necessary to understand, assess, and extend the conclusions of the

manuscript must be available to any reader of Science.”

Ideal

Helping Journals Manage Data

Page 9: Emerging Data Citation Infrastructure

Reality

Helping Journals Manage Data

Compliance is low even in best examples of journals

Checking compliance manually is tedious, hard to scale

Page 10: Emerging Data Citation Infrastructure

Attribution• Cite data as first class work• Identify contributors to data

Discovery• Associate a persistent id with a

work• Locate data via identifier• Locate data integral to article• Locate works related to data –

articles, derivatives, sources

Persistence• Reference exists as long as referring

object• Evidence persists as long as assertions

based on evidence? • Durability of data transparent?

Access• Citation provides for mediated

access• Access to surrogate• On-line access to object• Machine understandability• Long-term human

understandability

Provenance• Associate work with version of

evidence used• Verify fixity of information

Principles for Data Citation

Theory: Use Cases Operational Constraints?

-Syntax-Interoperability-Technical contexts of use

Page 11: Emerging Data Citation Infrastructure

Reference• Formal syntax used within the text of a publication to denote a relationship to an external object. May contain additional information about the portion/subset of external object implicated. Also known as “in-text reference”, “pin-cite”.

We applied contingency analysis to the greatest data ever. [Altman 2005]”

Citation• Formal description of external object, used for location and attribution.

Micah Altman; Karin MacDonald; Michael P. McDonald, 2005, "Computer Use in Redistricting", hdl:1902.1/AMXGCNKCLU UNF:3:J0PkMygLPfIyT1E/8xO/EA== http://id.thedata.org/hdl%3A1902.1%2FAMXGCNKCLU

Citation Metadata• Metadata that is systematically associated with citation through well-

known public service, catalog, or protocol.

<component_list> <component parent_relation="isPartOf">• <description><b>Figure 1:</b> This is the caption of the first

figure...</description>• <format mime_type="image/jpeg">Web resolution image</format>

External Service• Applications and services that consume, enhance, aggregrate citation

information.

Practice

Page 12: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Analysis Method

2 Workshops(70+ participants)

+ 1 Literature Review(400+ resources)+ 2 Task GroupsNAS & Co-Data (25+ members)

+ 60 Interviews+ 7 authors

Out of Cite, Out of Mind: The Current State of Practice, Policy,

and Technology for the Citation of Data

Page 13: Emerging Data Citation Infrastructure

Principles for Data Citation

- Separate- scientific principles- use cases- requirements

- Distinguish - syntax- semantics- presentation

- Design for - Ecosystem - Lifecycle - Stakeholders

- Implement- Incremental value for incremental effort- Think globally, act Locally

Analysis Approach

Page 14: Emerging Data Citation Infrastructure

Principles for Data Citation

1. Status of Data: Data citations should be accorded the same importance in the scholarly record as the citation of other objects.

2. Attribution: Citations should facilitate giving scholarly credit and legal attribution to all parties responsible for those data.

3. Persistence: Citations should be as durable as the cited objects. 4. Access: Citations should facilitate access to data by humans and by machines. 5. Discovery: Citations should support the discovery of data and their

documentation. 6. Provenance: Citations should facilitate the establishment of provenance of data. 7. Granularity: Citations should support the finest grained description necessary to

identify the data. 8. Verifiability: Citations should contain information sufficient to identify the data

unambiguously. 9. Metadata Standards: Citations should employ widely accepted metadata

standards. 10. Flexibility: Citation methods should be sufficiently flexible to accommodate the

variant practices among communities.

Data Citation Principles

Page 15: Emerging Data Citation Infrastructure

Principles for Data Citation

• Author.– The creator of the data set.

• Title.– As well as the name of the cited resource itself, this may also include the name of a facility and the titles of the top collection and main

parent subcollection (if any) of which the data set is a part.• Publisher.

– The organization (or repository) either hosting the data or performing quality assurance.• Publication date.

– Whichever is later: the date the data set was made available, the date all quality assurance procedures were completed, or the date the embargo period (if applicable) expired. In other standards an “Access Date” field is used to document the date the data set was successfully accessed.

• Resource type.– Examples: “database” or “data set.”

• Edition.– The level or stage of processing of the data, indicating how raw or refined the data set is.

• Version.– A number increased when the data changes, as the result of adding more data points or rerunning a derivation process, for example.

• Feature name and URI.– The name of an ISO 19101:2002 “feature” (e.g., GridSeries, ProfileSeries) and the URI identifying its standard definition, used to pick

out a subset of the data.• Verifier

– to verify the identity of the content.• Identifier.

– A resolvable web identifier for the data, according to a persistent scheme. There are several types of persistent identifiers, but the scheme that is gaining the most traction is the Digital Object Identifier (DOI).

• Location.– A persistent URL or UNF from which the data set is available. Some identifier schemes provide these via an identifier resolver service.

Citation Metadata Elements

Page 16: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Gaps

• Metadata/Structural– Granularity– Version Control– Microattribution– Contributor ID– Facilitation of reuse

• Practice– Author: use of citations to data– Journals: ad-hoc syntax and location– Infrastructure: failure to index citations and references to data,

even when associated with DOI’s– Tools: support for datasets in reference managers, etc.

Page 17: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Harmonizing Principles & RequirementsDataCite

• DOI• Creator

• Title• Publisher• Publication

Year

Digital Curation Center

1. The citation itself must be able to identify uniquely the object cited, though different citations might use different methods or schemes to do so.

2. It must be able to identify subsets of the data as well as the whole dataset.3.

a. It must provide the reader with enough information to access the dataset; b. indeed, when expressed digitally it should provide a mechanism for accessing the dataset through the Web infrastructure.4.

a. It must be usable not only by humans but also by software tools, so that additional services may be built using these citations.b. In particular, there need to be services that use the citations in metrics to support the academic reward system, and services that can generate complete citations.- See more at:

Force 11

• Data should be considered citable products of research.

• Such data should be held in persistent public repositories.

• If a publication is based on data not included with the article, those data should be cited in the publication.

• A data citation in a publication should resemble a bibliographic citation and be located in the publication’s reference list.

• Such a data citation should include a unique persistent identifier (a DataCite DOI recommended, or other persistent identifiers already in use within the community).

• The identifier should resolve to a page that either provides direct access to the data or information concerning its accessibility. Ideally, that landing page should be machine-actionable to promote interoperability of the data.

• If the data are available in different versions, the identifier should provide a method to access the previous or related versions.

• Data citation should facilitate attribution of credit to all contributors

Page 18: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Current Infrastructure

FigShare• Closed source• No charge• Archives data• Supports DOI’s, ORCIDS• Preserved in CLOCKSS

Data Citation Index• Commercial Service

(Thomson Reuters)• Indexes many large

repositories (e.g. Data-PASS)

• Beginning to extract citations from TR publications

Dataverse Network• Open Source System• Hubs run at Harvard

other universities• Archives data• Generates persistent

identifiers (handles, DOI’s forthcoming)

• Generates resolvable citations

• Versioned• Harvard Library Dataverse

now part of DataCite, Data-PASS preservation network

DataCite• DOI registry service

(DOI provider)• Data DOI metadata

indexing service (parallel to CrossRef)

• Not-for-profit membership Organization

• Collaborating with ORCID-EU to embed ORCIDs

Page 19: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Emerging DevelopmentsOpen Journal Data

Publication• Open source integration

of PKP-OJS and Dataverse Network

• Uses SWORD• Integrated data

submission/citation/publication workflow for OJS open journals

Journal Developments• NISO Recommendations on

Supplementary Materials• Sloan/ICPSR Data Citation Project• Data-PASS Journal Outreach• New journal types:

– Registered Replication journals– Null results journals– Data journals/data papers

Page 20: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Research Questions for Data Citation and Management

Page 21: Emerging Data Citation Infrastructure

Emerging Data Citation Practices

Research Areas Building on Richer Citations

Page 22: Emerging Data Citation Infrastructure

22

Brightening the “Dark Matter” of Scholarly Communications

Researcher Identifiers: Developments, Opportunities & Challenges

Research & Node Layout: Kevin Boyack and Dick Klavans (mapofscience.com); Data: Thompson ISI; Graphics & Typography: W. Bradford Paley (didi.com/brad); Commissioned Katy Börner (scimaps.org)

Seed Magazine, Mar 7, 2007http://seedmagazine.com/content/article/scientific_method_relationships_among_scientific_paradigms/

• Bibliometric and network analysis are the “telescopes” for exploring the structure of science

• Researcher ID’s allow us to see more connections, more reliably

• Identifiers for datasets, etc. reveal the “dark matter” of science

Some potential questions:• Are fields linked through evidence that are not

linked through publications?• How is the practice of science changing – are

data scientists, statisticians, etc. making bigger contributions?

• How would be the results of:– Catalyzing new research collaborations among individuals,

organizations?– Strengthening support for specific areas of interdisciplinary

research?– Growing the evidence base in particular areas?

Questions about how network of contributors and outputs evolves over time

Page 23: Emerging Data Citation Infrastructure

Additional Bibliography (Selected)

• Starr, J., & Gastl, A. (2011). IsCitedBy: A metadata scheme for datacite. D-Lib Magazine, 17(½). doi:10.1045/january2011-starr

• Piwowar, H., Vision, T.J. (2013). Data reuse and the open data citation advantage. PeerJ PrePrints. 1:e1v1. doi: 10.7287/peerj.preprints.1

• Cronin, B. (1984). The citation process: The role and significance of citations in scientific publication. London, United Kingdom: Taylor Graham.

• Van Leunen, M. (1992). A handbook for scholars. New York, NY: Oxford University Press.

Emerging Data Citation Practices

Page 24: Emerging Data Citation Infrastructure

Questions?

E-mail: [email protected]: micahaltman.comTwitter: @drmaltman

Emerging Data Citation Practices