Giving Researchers Credit for their Data: Phase 1 Report · 2015-12-04 · Giving Researchers Credit for their Data: Phase 1 Report Neil Jefferies (Bodleian Libraries), Thomas Ingraham

1

Giving Researchers Credit for their Data: Phase 1 Report Neil Jefferies (Bodleian Libraries), Thomas Ingraham (F1000Research), Fiona Murphy

“Researchers have too much to do, and too much of their work goes unrecognised. Tools like this will help to take the friction out of their daily lives and help them to share, improve and get

credit for their work. It blends the best of recent innovations with "traditional", collegiate academic culture." Josh Brown, ORCID

Project Rationale Institutions are increasingly developing IRs (institutional repositories) to hold the data outputs of their researchers, helping to reduce the individual burden of data archiving. However, only a subset of the data produced is associated with publications and thus reliably archived, and much important data is never published, shared or re-used. This represents a loss of scientific knowledge, may lead to the repetition of research and wastes public money.

2

To incentivise researchers to share unpublished data, we aim to develop a simple ‘one-click’ process where data, metadata and methods detail are transferred from an IR to a relevant publisher platform for publication as a data paper. This can be peer reviewed, indexed to increase visibility, and recognised by the community as a formal research output. Subsequently, details of the paper can be fed back to the IR to enrich the original record. The project builds on preliminary work undertaken by the WDS-RDA Publishing Data Workflows Working Group1 with the aim of understanding, standardising and evolving recommendations for best practice for research data publishing workflows. In Figure 1, the link between a Research Data Repository and and Research Data Journal is shown as a bold black line above. In essence, however, the workflow is likely to be sufficiently generic that it could be repurposed to link any repository and journal publishing system. For the WDS-RDA WG project, a set of repository and journal workflows was compiled and analysed. There was a broad disciplinary spread, including major players in the data publishing world and organizations dealing with “long tail” data. The goal was to understand the key components of the workflows, the parties responsible for specific workflow steps, and the gaps or barriers to realising benefits and efficiencies. The project found that many repository or journal providers look beyond the workflows that gather the information about the research data, and also want to make this information visible to other information providers in the field. This can add additional value to the data being published. If the information is exposed in a standardised fashion, data can be indexed and be made discoverable by third-party providers, e.g., data aggregators (see “services” box in Figure 2). Furthermore, the WDS-RDA group concluded that such data aggregators often work beyond the original data provider’s subject or institutional focus. Some data providers enrich their metadata (e.g., with data-publication links, keywords or more granular subject matter) to enable better cross-disciplinary retrieval. Ideally, information about how others download or use the data would be fed back to the original data providers. In addition, services such as ORCID2 are being integrated to allow researchers to connect their materials across platforms. This gives more visibility to the data through the different registries, and allows for global author disambiguation. The latter is particularly important for establishing author metrics. During the investigation process, many data repository and data journal providers expressed an interest in new metrics for datasets and related objects. Tracking usage, impact and reuse of the materials shared can enrich the content on the original platforms and help in engaging users in further data sharing or

1 Bloom et al: “Workflows for Research Data Publishing: Models and Key Components” International Journal of Digital Libraries

(2015, submitted) 2 http://orcid.org/

3

curation activities. Furthermore, such information is certainly of interest for funders3 of the respective infrastructures, and also funders of the research itself. The WDS-RDA group concluded that workflows to expose data publishing content to other providers in a standardized and strategic manner are crucial to enable discoverability of data. Presently, data reuse is hampered by the limited discoverability of the datasets. Projects working on addressing this important issue could serve as an accelerator to a paradigm shift for establishing data publishing within the communities. Although visibility services have been included in the add-ons, given the importance of discoverability to facilitate future reuse, the WDS-RDA group argued they should be incorporated into the basic service set of modern and trusted data publishing, and indeed, the

workflows in Figure 1 show a link from Data Publication to Services. The WDS-RDA group also asserted that all the components of a data publishing system need to work seamlessly in some sort of an integrated environment, and so they advocated for the implementation of such standards as already exist to be re-emphasised, and for the development of new standards where necessary, for repositories and all parts of the data publishing process. Trusted data publishing should be embedded in documented workflows. This helps to establish collaborations with potential partners and gives guidance to the researchers. The latter enables and encourages the deposit of reusable research data that will be persistent while preserving provenance. A major gap exposed by the WDS-RDA survey was the number of bespoke, often manual and journal-specific, solutions that have evolved in order to produce an output comparable with the final data paper which this project aims to produce. These point-to-point solutions demonstrate that publishers perceive need for this article-type. However, in practical terms the landscape

3Funders have an interest in tracking Return of Investment to assess which researchers/projects/fields are effective and whether proposed new projects consist of new or repeated work.

4

remains unhelpfully disparate for researchers in terms of the types and amounts of information requested as well as the need to input the same - or very similar - fields each time they are looking to expose their dataset. Given the range of platforms, formats and workflows, ultimately the barriers for researchers are too high for many. On the other hand, the Jisc ‘Giving Researchers Credit for their Data’ project has the potential for a quick, accurate product that can be trusted, measured and added to a researcher’s list of outputs with minimal difficulty or repetition of effort. In short, this represents an opportunity to build a scaleable model for the future. Working with ORCID is also revealed as key - both in terms of lowering the bar to entry through reduced workload and also because ORCID itself enables additional exposure of research outputs, thus enabling researchers to build their impact profiles. Overall Methodology The project is split into three phases to match the operation of the Jisc Data Spring programme. Phase 1: Feasibility Study: Analyse publisher workflows and repository functions and perform a simple gap analysis. Develop a simple straw-man workflow between an IR and a Publisher that bridges this gap - to enable researchers to easily publish datasets as peer reviewed data articles. Survey the community of publishers and repositories to establish demand and interest in such a workflow. Phase 2: Definition: Formalise these processes as an API definition (or extensions to existing APIs, e.g. SWORD) that has the potential to be used by other IR platforms (Fedora, DSpace, ePrints etc.) and publisher/editorial systems. We will publish a draft API and then hold a workshop open to all stakeholders, and seek input from the JISC Research Information Management Group. Phase 3: Reference implementation: Develop a simple demonstration implementation between a data repository (such as the Oxford University Research Data Archive) and a data paper publisher (F1000Research). This would demonstrate the benefits, provide a reference example for other instances and, through the use of test-driven development, provide a reusable test suite. It is anticipated that the implementation would take the form of a bridging application that would have the potential to de deployed “in-the-cloud” supporting many-to-many transactions. Gap Analysis Within the UK, DataCite has reasonable traction with Data Repositories and, more widely, the thinking behind DataCite is influential even if DataCite DOI’s have not been adopted. As such, it is not unreasonable to assume that Data Repositories can provide information that complies with the the DataCite Metadata Kernel. This will be verified as part of the survey process.

5

The WDS-RDA WG spreadsheet of publisher workflow behaviours was invaluable to this stage of the project.4 There were a number of examples of point solutions linking specific repositories and publishers but no other work on a generic pattern. As a general rule, however, the inbound metadata requirements for publisher workflows was observed to be quite low. A key assumption for such a generic API-based approach is that the source repository will already hold necessary and sufficient domain- and experiment-specific metadata that can be passed directly to the publisher (and associated peer reviewers) without further intervention. The API is aimed purely at addressing metadata issues associated with the mechanics of publishing itself. Proposed API/Application Architecture

The SWORD V2 protocol/format provides a ready basis for the transfer of repository objects although, conceivably, other protocols such as OAI-ORE, as demonstrated at Open Repositories 2010, can be used for a similar purpose. The determination of the most suitable

4 The spreadsheet can be found on Zenodo: “WDS-RDA Publishing Data Workflows Working Group Analysis sheet”, Murphy et al, http://dx.doi.org/10.5281/zenodo.19107

6

mechanism and the formal specification thereof will be carried out of in the second phase of the project. The repository is assumed to be responsible for the quality and integrity of the content that it holds as these are likely to entail a degree of domain and/or institutional specificity. A certification such as the Data Seal of Approval would be recommended but is outside the scope of the current specification although the presence of such may be a consideration during the peer review process. The repository will, in response to clicking on a button/link, produce a package that includes the dataset of interest, Datacite compatible metadata and additional domain/repo specific metadata. A persistent URI for the dataset would be required whether or not DataCite is used. This initial package will be passed to a helper app which resides in the “cloud” between the repository and publisher. The precise location of the app depends on the business/service model that will be developed as part of Phase 3. The app allows the submitter to perform a small number of simple operations that augment the automatically produced data package to enable it to feed into a publisher submission workflow. These could include:

i. Amending the dataset title/description to better suit a data paper (especially if multiple papers are produced from one dataset)

ii. Amending the author list (e.g. adding authors who are not listed as creators of the dataset per se). This could usefully leverage ORCID functionality.

iii. Attaching the textual content of the data paper (e.g. from Google Docs) iv. Attaching references for the paper (e.g. from Zotero) v. Attaching links to other relevant datasets - in the case of a cross-corpus publication

i. These would have the same requirements for persistent URI’s as the primary dataset

ii. The repositories holding these datasets would also receive publication notification as detailed below.

The updated package would then be fed (ideally using the same protocol/format as the repository-to-app transaction) to the publishers submission system. Using ORCID’s would ensure that name conventions, for example, should not be an issue for the incoming data package. The publisher would then operate their standard process for review up until the point of final acceptance for publication. At this point a much simpler data package would be returned to the source repository containing details of the data paper (DOI, persistent URI etc., simple metadata along with the original dataset identifiers so that the repository could record the relationship. This same data is likely to be passed to ORCID and potentially the Jisc Journal Router service. At this stage, the API would not explicitly aim to address the issue of version updates to datasets and their impact on the corresponding data papers. Establishing and recording the

7

linkage between datasets and papers provides an enabling mechanism but versioning raises a number of procedural issues that are not strictly soluble by purely technical means. Feasibility Study Two surveys, one aimed at repositories and the other at publishers, were created using SurveyMonkey (see Appendix for full questions and results) and sent out via various repository and data paper related working group mailing lists, as well as personal contacts. These surveys aimed to assess the potential uptake of the tool, the variety of platforms used, and what information could or could not be provided by the relevant party. Fifty respondents completed the surveys, with a greater proportion of responses coming from repositories (34) compared with publishers (16). We received additional responses directly via email. A primary deliverable for Phase I was to assess potential uptake of the tool. Almost every respondent stated that the proposed functionality would be of interest to their organization (45/49, with the remainder skipping the question). Eight of the publisher respondents already had at least one publication in mind that could be linked with this project. Only two publishers indicated (one via the survey, the other via email) that their submissions systems were currently manual and could not implement the API + helper app. Publisher and repository platforms were diverse, with almost half using custom systems, which lends additional support for the potential utility of developing a single API/protocol-based, many-to-many solution, rather than relying on multiple point-to-point solutions. Some existing point-to-point solutions were identified through the survey process but no similar attempts to define a standardised API/protocol emerged. These findings, together with the number of prominent organizations that have offered to collaborate on this project (see below), indicates the demand from the community for implementing such a tool. Repositories Survey Results

“This project comes at a time when researcher workflows for data are in urgent need of improvement and formalisation. A huge driver of such workflows is simplicity. By taking

advantage of open APIs and building crucial middleware, this project will encourage reproducibility and transparency in research going forward.” Mark Hahnel, FigShare

Overall Reaction (Q1,Q2,Q4,Q8) Of the 34 repositories that responded to the survey, 31 indicated that they were interested in the functionality presented in the questionnaire - with 3 respondents skipping the question. Institutional, subject and commercial repositories were all represented with significant international interest too. 27 respondents indicated that they would like to be kept informed of further developments and provided contact details (email).

8

Repository Platforms (Q3)

There was significant heterogeneity in the platforms used by data repositories, perhaps indicative of the relative immaturity of the application space: EPrints, DSpace and Fedora were the most frequently used platforms, but combined they only constituted about half of the total, with most of the remainder using custom platforms. Ability to Provide Metadata (Q5) The main issue encountered with the metadata requested from repositories was around author contact details (emails) which were either not collected or only collected for the depositing agent. A lesser concern was providing standardised name forms for publishing and citation. This suggests that there is utility in leveraging a service such as ORCID which provides a simple mechanism for overcoming both issues. Metadata Desiderata (Q6) This question provoked a number of responses that led to modifications to the original proposed functionality. In particular the need to link-in additional datasets/resources for broader studies and the consequent need to potentially notify multiple repositories of successful publication. A more complex issue was raised around rights/licences surrounding the data and/or the publication which would need to be addressed. At the moment, data journals are open access which simplifies this issue somewhat - however, this is not guaranteed to be the case in the future so suitable mechanisms will need to be incorporated into the workflow. A number of respondents also noted that funding details would need to be transmitted although, in practice, the app would be expected pass through this information unchanged and so would

9

not impinge on the workflow except, as noted previously, where funding has rights/licence requirements attached. Feedback & Concerns (Q7) Most of the concerns were around curation and review of the data which should, ideally, be the responsibility of the repository and publishers respectively even in the absence of this project which facilitates but does not obviate the need for these processes, How institutions deal with these issues when third party repositories are involved is outside the scope of this project. Publisher Survey Results “Data publishing workflows need a better integration of articles and data. Projects (as the one presented here) that facilitate better interoperability based on persistent identifiers build trust

and help engaging the research communities in data sharing - a crucial step to foster Open Science.” Sunje Dallmeier-Tiessen, Research Data Alliance

Overall reaction Of the 15 publishers that answered the question, all stated their interest in the proposed tool (1 skipped the question). Further, the majority of publishers stated that the tool would fit into their workflows: 10 responded ‘yes’ to this question and of the two that answered ‘no’, one (regarding the type of editorial feedback provided by their journal) was outside the scope of the project, and the other, a request for integration with ScholarOne and Editorial Manager, is in fact an aim of this project (see Q6 in the Appendix). Four skipped the question. Publisher platforms

10

Publisher submission systems had a similar degree of diversity as repository platforms. As expected ScholarOne, Editorial Manager and OJS were the most commonly used (by 7 publishers), but almost half (5) used their own custom, journal- or publisher-specific submission systems, including Elsevier’s EES/EVIS system (3 skipped the question). Specific publisher comments and concerns Three pertinent comments were raised in this survey:

i. Big data: Data papers have especial value in documenting very large datasets, but transferring such data via the API/protocol would not be feasible. The easiest way of solving this would be for the link to the dataset rather than the dataset itself to be transferred to the helper app and then onto the publisher, in cases where data exceeds a defined threshold.

ii. Restricted access/embargoed datasets: Although this tool would not be suitable for, or of interest to, the small number of repositories that store sensitive or identifying data, researchers may wish to embargo their data before the publication of their data paper. Many repositories support data embargoing and we would expect the repository to include this metadata, where relevant, in the package they send.

iii. Multiple datasets in multiple repositories: Some publishers stated that they often publish data papers containing data deposited in multiple repositories, and queried how the tool would handle such cases. This could easily be achieved by adding fields into the helper app where users can add linked identifiers to other datasets that they wish to include. In fact, this solution brings the additional advantage of enabling researchers to submit their data and metadata directly via the tool, rather than submitting via the data repository.

Broader Feedback In addition to the survey responses we had a number of expressions of interest in further collaboration in this area:

� THOR: Service transition and sustainability (mostly Phase 3)

� ORCID: enabling technologies, with particular emphasis on sign-on methodology.

� Elsevier: publishing a data journal called “Data in Brief” and have been working on interlinking articles and data for some time. Would like to contribute expertise, share experience, help progress the thinking in interoperability between repositories and publishers.

� Resource Identification Initiative: support in providing use-case (for permanent

identifiers, changing practice and policy), engagement and roll-out, plus aligning technologies if resources allow.

� Open Science Framework, Center for Open Science: volunteered to be involved in the

Phase III pilot.

11

� Pensoft: have already developed an equivalent point-to-point solution for creating data

papers from GBIF indexed data and metadata (EML and Darwin Core) and are interested in sharing experience.

� The Research Data Alliance, World Data System and Kudos have all expressed interest

in the project and a wish to be kept informed of progress. Next steps Pending success in the next funding round, the next steps would be as follows:

i. Assemble the core team. i. As well as current project leads (Thomas Ingraham and Neil Jefferies) to include

a p/t Project Manager, partly funded developers based at the repository and publisher ends, a lead developer working on the Helper App itself (fully funded but likely not f/t)

ii. Build partner participation: includes Elsevier, ORCID, Pensoft, RRID, OUP, F1000R, Figshare, Ubiquity THOR, DataCite, RDA, OJS plus other Jisc Research Data Spring projects and Jisc Journal Data Policy Registry Project

ii. Develop a detailed spec. i. Platform ii. 3 distinct pieces of work - Helper App, plus an interface on either side to enable

integration with the repository at one end and the publisher at the other. iii. Build presence to engage and leverage community support. Keep a clear, practical focus

and encourage new partners to emerge. Use GitHub to enable contributions (coding and critiques) from others

i. Jisc wiki ii. blog/listserv iii. GitHub (+ code archiving in Zenodo) iv. Research Data Alliance meeting, Paris, September. This coincides with ORCID

and other key partner meetings

12

Appendix A. Surveys Both the surveys were introduced with the following text: JISC is supporting our group to research the case for and develop an API and ‘helper app’ that allows researchers who have deposited data in a repository to automatically submit a ‘data article’ to a journal at the click of a button.

The proposed API will transfer data and metadata from the repository to a journal’s submission system, with an intermediary ‘helper app’ allowing authors to add their manuscript, references and other necessary information before submission. Article metadata will be sent back to the repository upon publication. These tools will be designed for maximum interoperability and made freely available.

This initiative will make it much easier for authors to obtain credit, visibility and peer review for their data via data article publication. This will have knock-on benefits for other stakeholders: journals will receive more submissions and link referrals, repositories will gain more deposits and more complete metadata, and readers will have better methodological detail to help them reuse or replicate data.

At this stage of the project we want to:

Gauge the level of interest and potential uptake from key stakeholders. Assess what information publishers need for data article publication and what repositories can provide, to determine what fields should be included in the helper app. Gather general feedback.

We would very much appreciate 10 minutes of your time to respond to these questions:

13

Questions For Repositories

Giving Researchers Credit for their Data: Repositories 1. What is your organisation's name and home url?

2. What is the best contact name, job title and email address?

3. What platform does your repository use?

4. Is this functionality of interest to your organisation? Yes/No/Any comments

5. Please take a quick look at this list of publisher's needs:

1. What Repository provides to Helper App (outgoing data package):

a. Repository name

b. Dataset title

c. Dataset legend/description

14

d. Dataset permanent identifier (preferably DataCite DOI)

e. Dataset citation (URI/URL derived from above)

f. Dataset Author/Creator names (First, last - separate fields) NB: It is anticipated that ORCIDs will become more prevalent in the near future

g. Author emails (matching corresponding author entry)

2. Helper App needs to add to the data package en route to Publisher:

a. Article type (automatically assigned as "Data Paper")

b. Article title (default to Dataset Title)

c. Authors (checkboxes next to dataset authors imported from repository).

i. Perhaps ability to add new authors/ORCIDS/emails.)

d. File uploads (textual content of the paper)

e. ‘Notes to Editorial Team’ (free text field)

f. Licence agreements (checkboxes/select from a list?)

g. Declarations (checkboxes, e.g. that the article hasn't been submitted elsewhere)

3. Information package that Publisher gives Repositories:

a. Article DOI

b. Article title

c. Publication date

Is there any information listed here that your organisation could not automatically provide? Yes/No/If 'yes', please add comments

6. Does the document include all the information you would wish to provide and receive? Yes/No/If no, please give details

7. Does any aspect of this project concern you, e.g. data quality, curation, review? Yes/No/If so, please explain:

8. Would you like to join our listserv for further information and/or involvement? Yes/No/If 'yes'' then please give your email address:

9. Are there any other repositories or projects that you would recommend we contact? If so, please give (contact) details:

15

Questions for Publishers 1. Is this functionality of interest to your organisation?

Yes/No/Comments

2. Is there a specific publication that you would be interested in linking with this project? Yes/No/If 'Yes' then please specific the publication's title and/or url

3. The diagram below shows the schema for the proposed API and helper app. Does the schematic express the relevant workflow accurately with respect to your title? Yes/No/Comments

Giving Researchers Credit for their Data: Publishers 4. Please see the publishers' need document below:

1. What Repository provides to Helper App (outgoing data package):

a. Repository name

b. Dataset title

c. Dataset legend/description

d. Dataset permanent identifier (preferably DataCite DOI)

16

e. Dataset citation (URI/URL derived from above)

f. Dataset Author/Creator names (First, last - separate fields) NB: It is anticipated that ORCIDs will become more prevalent in the near future

g. Author emails (matching corresponding author entry)

2. Helper App needs to add to the data package en route to Publisher:

a. Article type (automatically assigned as "Data Paper")

b. Article title (default to Dataset Title)

c. Authors (checkboxes next to dataset authors imported from repository).

i. Perhaps ability to add new authors/ORCIDS/emails.)

d. File uploads (textual content of the paper)

e. ‘Notes to Editorial Team’ (free text field)

f. Licence agreements (checkboxes/select from a list?)

g. Declarations (checkboxes, e.g. that the article hasn't been submitted elsewhere)

3. Information package that Publisher gives Repositories:

a. Article DOI

b. Article title

c. Publication date

Does the publishers' need document include all the information requirements you would wish to see? Yes/No/Comments

5. Does any aspect of this project concern you, e.g. data quality, curation, review? Yes/No/If so, please explain:

6. What editorial submission system do you use?

a. ScholarOne

b. Manuscripts

c. Editorial Manager

d. Open Journal Systems

e. Other (please specify)

17

7. Would you like to join our listserv for further information and/or involvement?

Yes/No/If 'yes'' then please give your email address:

8. Are there any other organisations that you would recommend we contact? If so, please give (contact) details:

18

Appendix C. Survey Circulation:

■ Research Data Alliance Publishing Data Interest Group (includes 4 Working Groups - Workflows, Bibliometrics, Services, Cost Recovery)

■ World Data Service mailing list (included a blogpost) ■ Earth System Science Informatics ■ Force11 ■ JISC Data Publication ■ JISC Repositories ■ Twitterfeeds - retweets from Kudos, Oliver Clements, Cyndy Chandler, Anne Horn, Tom

Demeranville, West Coast Librarian, Ann L Starkey, Sharon Lawler, ALPSP, ORCID, Richard Ackerman, ESIP Federation, World Data System, Research Data Alliance, Kooiti Masuda, DaRa Info, Laura N Campbell, Yvonne Nobis, Robin Rice

■ STM Research Data Group ■ Personal contacts

Documents

Giving Researchers Credit for their Data: Phase 1 Report · 2015-12-04 · Giving Researchers Credit for their Data: Phase 1 Report Neil Jefferies (Bodleian Libraries), Thomas Ingraham