and Procedures for Data a case study in the ... - dcc.ac.uk

Processes and Procedures for Data Publication: a case study in the

Geosciences

Sarah Callaghan, Fiona Murphy, Jonathan Tedds, Rob Allan, John Kunze, Rebecca Lawrence, Matthew S. Mayernik, Angus Whyte, and the PREPARDE project

team#preparde

[email protected] @sorcha_ni

We’ve come a long way• Data is the foundation of science –

without it we can’t test our assertions or reproduce our results

• The Internet allows us to share and publish our data quickly and easily

• But there are still serious problems to address:– Persistence– Data and metadata quality– Attribution and credit– … and many more

Engraving of printer using the early Gutenberg letter press during the 15th century.Date unknown ‐ estimate 16th ‐ 19th centuryhttp://commons.wikimedia.org/wiki/File:Gutenberg_press.jpg

Why do we want to cite and publish data?

• Pressure from the UK government to make data from publicly funded research available for free.

• Scientists want attribution and credit for their work• Public want to know what the scientists are doing

• Research funders want reassurance that they’re getting value for money

• Relies on peer‐review of science publications (well established) and data (not done yet!)

• Allows the wider research community to find and use datasets, and understand the quality of the data

• Extra incentive for scientists to submit their data to data centres in appropriate formats and with full metadata

http://www.evidencebased‐management.com/blog/2011/11/04/new‐evidence‐on‐big‐bonuses/

PREPARDE: Peer REview for Publication & Accreditation of Research Data in the Earth sciences

• Lead Institution: University of Leicester• Partners

– British Atmospheric Data Centre (BADC)– US National Centre for Atmospheric Research (NCAR)– California Digital Library (CDL)– Digital Curation Centre (DCC)– University of Reading– Wiley‐Blackwell– Faculty of 1000 Ltd

• Project Lead: Dr Jonathan Tedds (University of Leicester, [email protected])• Project Manager: Dr Sarah Callaghan (BADC, [email protected] )• Length of Project: 12 months• Project Start Date: 1st July 2012• Project End Date: 31st June 2013

• Partnership formed between Royal Meteorological Society and academic publishers Wiley Blackwell to develop a mechanism for the formal publication of data in the Open Access Geoscience Data Journal

• GDJ publishes short data articles cross-linkedto, and citing, datasets that have been deposited in approved data centres and awarded DOIs (or other permanent identifier).

• A data article describes a dataset, giving details of its collection, processing, software, file formats, etc., without the requirement of novel analyses or ground breaking conclusions.

• the when, how and why data was collected and what the data-product is.

Geoscience Data Journal, Wiley‐Blackwell and the Royal Meteorological Society

BADCData Data

BODCDataData

A Journal (Any online

journal system)

PDF PDF PDF PDF PDFWord processing softwarewith journal template

Data Journal(Geoscience Data Journal)

html html html html

1) Author prepares the paper using word processing software.

3) Reviewer reviews the PDF file against the journal’s acceptance criteria.

2) Author submits the paper as a PDF/Word file.

Word processing softwarewith journal template

1) Author prepares the data paper using word processing software and the dataset using appropriate tools.

2a) Author submits the data paper to the journal. 3) Reviewer reviews

the data paper and the dataset it points to against the journals acceptance criteria.

The traditional online journal model

Overlay journal model for publishing data

2b) Author submits the dataset to a repository.

Data paper mock‐up

Dataset citation is first thing in the paper and is also included in reference list (to take advantage of citation count systems)

Example steps/workflow required for a researcher to publish a data paper

3 main areas of interest (in orange) 1. Workflows and cross‐linking

between journal and repository2. Repository accreditation3. Scientific peer‐review of data

• Division of area of responsibilities between

• repository controlled processes

• journal controlled processes

PREPARDE topics

Data repository workflows• Data centre and journal workflows captured

• Workflows are very varied! No one‐size fits all method

• Can have multiple workflows in the same data centre, depending on interactions with external sources (“Engaged submitter”/ “Data dumper” / “Third party requester”)

Repository Workflow – NCAR Comp. & Info. Systems Lab Research Data Archive (RDA)

Data Preparation:•Automated file collection. •Check integrity of file receipts.•Compare bytes and checksums (if available) with original data providers.

Not ok Ok

Data Ingest

Contact data provider

Processing:•Validate files – using software, read the full content of every file.•Pull out metadata.•Identify errors and metadata holes. •Do time‐series checks.•Check metadata against internal standard/expectation. •If necessary, filter data or fix metadata.

Metadata Database •Spatial info•Temporal info•Global Change Master Directory (GCMD) keywords•Parameters•Format table relationships

Embargo

Archive(Tape‐based)

Notification to provider/user community

Distribute metadata

GCMD

Check with data provider for changes to files

Remote backup

Errors found NCAR CDP

BADC

Publish Metadata – User GUIs

Online Data(Most Demanded)

… OAI‐PMH

Access Development Phase

Journal workflow• Work on comparisons and

identification of cross‐linking points is continuing. • Aim is to minimise effort

needed to submit data paper by taking advantage of already submitted metadata.

• Workshop planned for 2013 – if interested please email [email protected]

Data Centre

Repository accreditationLink between data paper and dataset is crucial!

• How do data journal editors know a repository is trustworthy?

• How can repositories prove they’re trustworthy?

What makes a repository trustworthy?Many things: mission, processes, expertise, workflows, history, systems, documentation, ...

Assessing trustworthiness requires assessing the entire repository workflow

Peer review of data is implicitly peer review of repository

And what does “trustworthy” mean, when you get right

down to it?

Are these the right requirements?Repositories should:• Have long term data preservation plans in place

for their archive.• Actively manage and curate the data in their

archive.• Provide landing pages giving extra information

about the dataset (metadata) and information on how to access the data.

• Use persistent, actionable links (e.g., DOIs, ARKs) to cite data held in their archive

• Resolve cited dataset links to landing pages

Peer‐review of dataLots of issues to deal with:• Technical

– author guidelines for GDJ– NERC Data Value Checklist– implicit peer review of repository?

• Scientific– pre‐publication? – post‐publication? E.g. F1000R– guidelines on uncertainty e.g. IPCC– discipline specific?– EU Inspire spatial formatting

• Societal– contribution to human knowledge– reliability http://libguides.luc.edu/content.php?pid=5464&sid=164619

Things to think about when reviewing data:

• Data article– describes the experiment in a reader-friendly

way– may contain quick-look plots of the data– should describe how, when and why the data

were created• so that the quality of the scientific method

can be examined• provide information on the importance,

uniqueness and applicability to other purposes of the data.

• Dataset metadata– clearly identify and describe the data.

• The data themselves– Are readable through standard tools, or have

software provided– Are accessible through the repository

Image credit: Borepatch http://borepatch.blogspot.com/2010/06/its‐not‐what‐you‐dont‐know‐that‐hurts.html

Tell us what you think• Workshop on repository accreditation at Digital Curation Conference, Thurs 17th January 2013 ‐ fully booked! http://www.dcc.ac.uk/events/idcc13/workshops

• Workshop on peer‐review of data planned for March 2013 at the British Library

• Project website:http://proj.badc.rl.ac.uk/preparde/wiki• Project blog:http://proj.badc.rl.ac.uk/preparde/blog

• Always happy to get input from others!

Thanks! Any questions?

The project is led by the University of Leicester and the support of JISC and NERC in funding the PREPARDE project

is gratefully acknowledged.

Composite image from Flickr user bnilsen and Matt Stempeck(NOI), shared under Creative Commons license

PREPARDE objectives• capture and manage workflows required to operate the Geoscience Data Journal

– from submission of a new data paper and dataset, through review and to publication

• develop procedures and policies for authors, reviewers and editors– allow the Geoscience Data Journal to accept data papers as submissions for publication– focus on guidelines for scientific reviewers who will review the datasets

• incorporate some technical developments at the point of submission– data visualisation checks– interface improvements– enhance the resulting data publications

• put in place procedures needed for data publication in the California Digital Library

• interact with the wider scientific and data community– provide recommendations on accreditation requirements for data repositories

• engage the user and stakeholder community– promote long‐term sustainability and governance of data journals

Project team, roles and responsibilities• University of Leicester (UoL): project lead, academic liaison and administration

– liaise with range of academic stakeholders internationally => input to workshops and guidelines

• British Atmospheric Data Centre (BADC): project management– provide technical input into the cross‐linking, workflows and scientific review work packages

• Wiley‐Blackwell: provide publishing platform– work with partners to develop and implement workflows & support technical enhancements to enhance authorial

and readership experiences

• California Digital Library (CDL): investigate a lightweight data paper convention– informal (crowd‐sourced) review – contribute a broader international perspective from outside the Earth Sciences

• US National Centre for Atmospheric Research (NCAR): use cases of data review methods– feedback on development of peer review guidelines and workflows for the Geoscience Data Journal– contribute to the data repository accreditation report and workshops

• F1000: contribute broader perspective from the biomedical sciences– extend the impact from PREPARDE to the biomedical community through launch of F1000 Research– use of F1000 Advisory Panel and F1000 Faculty & co‐organise a workshop

• Digital Curation Centre: assess peer review models & overlap with data repository appraisal and ingest processes

– respective roles of main stakeholders– Workflows and interoperability with data centres and publishers– Trusted Digital Repository certification.

• Publishing a dataset in a data journal will provide academic credit to data scientists, and without diverting effort from their primary work on ensuring data quality.

• Funders want to get the best possible science for their money. Publication in a data journal ensures that the dataset is uploaded to a trusted repository where it will be backed up, archived and curated and so won’t be vulnerable to bit-rot or to being lost/stored on obsolete media.

• The peer-review process also reassures the funder that the published dataset is of good quality and that the experiment was carried out appropriately.

• Data journals will be a good starting point for information for researchers outside the immediate field, about what sort of data is available and how to access the data.

• Data publication will help show transparency in the scientific process, improving public accountability.

• Opportunities to form partnerships with other organisations with the same goal of data publication to exploit common activities and achieve a wider community buy-in. For example, the CODATA-ICSTI Task Group on Data Citation Standards and Practises, DataCite and others.

GDJ Rationale and Incentives

Documents

and Procedures for Data a case study in the ... - dcc.ac.uk