56
An Leabharlann UCD Julia Barrett UCD James Joyce Library Data Caring: Why Manage Your Research Data?

Data Caring: Why Manage Your Research Data? · Data Management Plan (DMP) A data management plan is a formal and practical document developed at the start of a research project which

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

An Leabharlann UCD

Julia Barrett

UCD James Joyce Library

Data Caring: Why Manage

Your Research Data?

Outline

•Research data: definition

•Data management drivers

•Data management benefits

•Data management components

•Documentation practices

•File management

•Storage

•Access management

•Data sharing

•Data repositories

•Help @ UCD

What is research data?

“The data, records, files or other evidence, irrespective of their content or

form (e.g. in print, digital, physical or other forms), that comprise research

observations, findings or outcomes, including primary materials and analysed

data.” – Australian National Data Service

Examples:

•Statistics and measurements

•Results of experiments or simulations

•Laboratory notebooks

•Observations e.g. fieldwork

•Survey results – print or online

•Interview recordings and transcripts

•Images, from cameras and scientific equipment

Five Top Reasons to Protect Your Data

and Practise Safe Science

https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/

1. Data output is growing rapidly

• 90% of all the data in the world has been

generated over the last 2 years. Scientific

data output is currently increasing at an

annual rate of 30%.

• Graphic

2. Despite significant investment, data is

not being managed effectively

• $1.5 trillion is the current estimated total global

spend on research and development, which could be

at risk. Much of the data generated is lost – in one

study, the odds of sourcing datasets declined by

17% each year, with 80% of datasets over 20 years

old not available.

3. Much of the data remains unverifiable

• 54% of the resources used across 238

published studies could not be identified,

making verification impossible.

4. Time and money is wasted, impacting

on science and society

• Since 2000, over 80,000 patients have taken

part in clinical trials based on research that

was later retracted because of error or fraud.

The number of retractions due to error has

grown over fivefold since 1990.

5. Funders increasingly require data

management and sharing policies

• Key funding bodies such as the NIH, MRC

and Wellcome Trust now request data

management plans be part of

applications.

Why manage data?

I want to be able to find my data two years from now

My colleague left 6 months ago and I can’t make any sense of his data

On a recent train journey my shampoo leaked into my laptop and I lost my files….

I dropped my laptop, lost about 2 months of work, yeah I know, should make backups more often, but I could never think of this happening.

Lost data

At more than 500 laundromats and dry cleaners in the UK, 17,000 USB flash drives were left behind between December 2010 and January 2011. According to the study’s researchers at Credant Technologies, that’s a 400 percent increase inlost devices compared to the year before.

•http://blog.allusb.com/2011/03/rising-trend-in-lost-usb-flash-drives/

• http://www.connachttr

ibune.ie/galway-

news/item/1372-

blaze-that-destroyed-

galway-hse-unit-

suspicious-say-gardai

Benefits of Managing Data

• Saves time – being able to find

things

• Reduces possibility of data loss

through managed back-ups,

storage and security processes

• Reduces errors e.g. badly

described data, confusion

between file versions

• Enables you and others to find

and understand what you have

done through the provision of

descriptions, metadata, file

management etc.

• Provides evidence of work

undertaken

• Provides evidence of validity

of work undertaken

• Verifies – provides evidence

of logical processes and

methods

• Ensures retraceability and

reproducibility

Data Management Plan (DMP) A data management plan is a formal and practical document developed at the start of a research project which outlines all aspects of the data, including:

• The nature of your data

• How it is organised and described

• How it is shared with others

• How it will be stored in the long-term – https://www.admin.ox.ac.uk/rdm/dmp/plans/

Developing a data management plan helps to ensure the research data are accurate, complete, reliable, and secure both during and after completion of the research.

Funding bodies increasingly require that grant applications include data management plans. For example, current NSF (National Science Foundation - American funding agency) policy states that as of January 18, 2011 all NSF proposals must have a supplementary document of no more than two pages labelled data management plan.

http://www.youtube.com/watch?v=Lc82pxxRkMo

Data management components

Storage

• Backup strategy; Security

File management

• File organisation; File naming; File formats

Documentation practices

• Project documentation; process documentation; data documentation

Access management

• Data sharing; publishing; archiving

Data Management Checklist

http://www.ucd.ie/t4cms/Guide121.pdf

Documentation practices

PROJECT

DESCRIPTION

• Project title

• The aim/ purpose of

the research

• Project duration

INVENTORIES

• Servers, directories,

data, lab equipment

etc.

CONTEXT

• Principal investigator

• Researchers/other project

members

• Main contact details

• Collaborators/Partner

Institutions

• Roles and responsibilities

• Funding source(s) and

requirements

• Budget

Documentation practices

• Processes

– Sometimes individual effort, sometimes collaborative

– Protocols, code commentary

– Workflow descriptions/diagrams

• Data Capture

– How will data be created?

– Any special hardware / software requirements?

– How will metadata be captured, created and managed?

Data documentation • “A crucial part of making data user-friendly, shareable and

with long-lasting usability is to ensure they can be

understood and interpreted by any user. This requires

clear data description, annotation, contextual information

and documentation

• Data documentation explains how data were created or

digitised, what data mean, what their content and

structure are, and any manipulations that may have taken

place. It ensures that data can be understood during

research projects, that researchers continue to understand

data in the longer term and that re-users of data are able

to interpret the data. Good documentation is also vital for

successful data preservation.” (UK Data Archive).

• Good documentation ensures your data can be:

– Searched for and retrieved

– Understood now and in the future

– Properly interpreted, as relevant context is available.

Metadata Definition

Metadata is...data about data

“Structured information that describes,

explains, locates, or otherwise makes it easier

to retrieve, use, or manage an information

resource”.

It enables:

• resource discovery and retrieval

• data sharing and reuse – allows data to be

interpreted or analysed by others

• management of resources – records aspects of

the production and curation process, rights

information, location and access information

Understanding metadata, NISO, 2004

Metadata: Data about Data

Three broad categories of metadata are:

Descriptive - common fields such as title, author,

abstract, keywords which help users to discover

online sources through searching and browsing.

Administrative - preservation, rights management,

and technical metadata about formats.

Structural - how different components of a set of

associated data relate to one another, such as a

schema describing relations between tables in a

database.

Metadata answers basic questions about data

• Who created and maintains the data? Who can access it?

• Why was the data created?

• What is the content and structure of the data? What changes have been made to it?

• When collected? When published?

• Where is the geographic location?

– Where is the data held?

• How was the data produced?

“What information would I need to understand and use this data in twenty years?”

Considerations for choosing a standard

• The discipline, domain

• The format of the data

• Repository or funder requirements

• Recognition and/or certification of standard

• Controlled vocabularies, thesauri, and

authorities

• Available metadata tools

• Skills required and time available

Metadata creation tools

ISSDA Data Deposit Form

Metadata creation tools

Knowledge Network for Biocomplexity

Morpho User Guide

https://knb.ecoinformatics.org/software/morpho/MorphoUserGuide.pdf

Organising Data & File Formats

• File structure

• Folder structure

• File and folder naming conventions

• Versioning

• File formats

– Choose platform and vendor-independent file formats to ensure the best chance for future compatibility

• File transformation

Organising Data: Good File Management

• Research data files and folders need to be labelled and organised in a

systematic way so that they are both identifiable and accessible for current and

future users. The benefits of consistent data file labelling are:

Data files are distinguishable from each other within their containing folder

Data file naming prevents confusion when multiple people are working on shared files

Data files are easier to locate and browse

Data files can be retrieved not only by the creator but by other users

Data files can be sorted in logical sequence

Data files are not accidentally overwritten or deleted

Different versions of data files can be identified

If data files are moved to other storage platform their names will retain useful context

http://www.youtube.com/w

atch?v=Z_ysxiAGKC8&featu

re=player_embedded

• Use a combination of

different types of

information to make

the context and

content of a file

clear, e.g.

– Data source

– Measured variable

– Experiment

– Date

“AHRC_TechnicalApp_Response20120925.docx” rather than:

“what we got back from funders about the data stuff.docx”

Data storage and backup

• Estimated size of data ; growth

rate

• Where (physically) will you

store the data? Server,

pc/laptop, external storage

device…geographically

distributed…

• On what media will you store

the data?

• Whose responsibility is the

storage of the data?

• How will you transmit the data,

if required?

• How is your data backed up?

• How often is your data backed

up?

• Who is responsible for this?

• Avoid single points of

error

– Use managed networked storage whenever possible

– Move data off portable media

– Make multiple backups: Lots of Copies Keeps Stuff Safe (LOCKSS)

– Be wary of software lifespans

Data security

• How will you ensure the security of your

data?

– How will data be shared during the project?

– How will you organise access to sensitive data?

– How will you enforce permissions, restrictions and embargoes?

– Other security issues e.g. damage, theft

• Information Security Risk Assessment for UCD Research Groups online survey

– https://docs.google.com/a/ucd.ie/spreadsheet/viewform?formkey=dGV3QUF4UGxTaTkweDFJWlhiU1g2VVE6MA#gid=0

Access management: Ethics and IP

• Are there any ethical or privacy issues that may

prohibit the sharing of some or all of the dataset/s?

• If so what possible ways might there be to resolve

these? (E.g. referral to UCD’s Ethics Committee;

anonymisation of data; formal consent

agreements; different levels of access to data, e.g.

research purposes only, no commercial)

• Who owns the copyright and other intellectual

property?

Data sharing: drivers and benefits

• Facilitating research and discovery

• Scientific integrity

• Funders and government

• Journal publishers

• Recognition and impact

• Collaboration

• Funding application advantage

Facilitating research and discovery

• Cucumbers, E-coli

and open data: The

2011 outbreak of E. coli

poisoning in Germany

illustrated the changes in

attitudes to sharing scientific

research and data; within

weeks of the outbreak, the

genome of the bacteria was

identified, and given the

seriousness of the outbreak,

the results were published on

the Internet as soon as they

were available.

Facilitating research and discovery

• “As research becomes more data intensive,

research datasets increase in number and size.

Re-using (combinations of) research datasets

produced by researchers in the same discipline or

from different disciplines brings about novel

approaches, such as data exploration, simulation

and modelling, system level science, and

transdisciplinary research”.

• Van der Graaf, M. and Waaijers, L. (2011). A Surfboard for Riding the Wave. Towards a

four country action programme on research data. A Knowledge Exchange Report, available

from www.knowledge-exchange.info/surfboard

Scientific integrity

• Publishing research data and citing its location in

published research papers allows others to replicate,

validate or build upon your results thus improving

the scientific record by encouraging scientific

enquiry and debate.

• Openly sharing research data also encourages the

improvement and validation of research

methods and minimises the need for data re-

collection.

• Verify results; uncover errors

• Contains errors and excludes some data that

significantly undermined the results.

• The results were published in a prestigious

journal, the American Economic Review, that

failed to enforce its own data availability policy

Funders and Government

• NERC (Natural Environment Research Council)

expects everyone that it funds to manage the data

they produce in an effective manner for the lifetime

of their project, and for these data to be made

available for others to use with as few restrictions as

possible, and in a timely manner.

• To protect the research process NERC will allow those

who undertake NERC-funded work a period of time to

work exclusively on, and publish the results of, the

data they have collected. This period will normally be

a maximum of two years from the end of data

collection.

• UK Funders’ Data Policies: www.dcc.ac.uk/resources/policy-and-

legal/funders-data-policies

Funders and Government

• Research data in general should be deposited

whenever this is possible, and linked to associated

publications where this is appropriate. It should be

made openly accessible, in keeping with best practice

for reproducibility of scientific results.

– European and national data protection rules must be taken into account in relation to research data, as well as concerns regarding trade secrets and intellectual property rights, confidentiality, or national security.

– At a minimum, metadata describing research data and its location and access rights should be deposited.

– It is recognized that managing access to research data may be a new approach for many research organisations. This policy is intended to encourage the improvement of discoverability and development of open access to research data over time.

National Advisory Committee on Drugs and

Alcohol: NACDA

Research Data Management Policy

• Specifies:

– Copyright (owned by NACDA)

– Data quality

– Provision of supporting material

– Data security

– Confidentiality & protection of personal data by data anonymisation

– Data sharing

– Deposit to a data archive / repository

– Informed consent to not preclude data sharing beyond the original research

– Data management plan requirement

Funders and Government

• Minister Howlin said “The public service needs to share data

to deliver services, and more data-sharing will be necessary

to deliver the joined-up services we aspire to. At the same

time data protection and privacy are concerns for all of us.

Today we agreed that a new legal framework is required to

enable the public service deliver the next generation of

services both effectively and securely.”

Journal Publishers

• Increasing number of journal publishers require the

sharing of associated data

• DRYAD (www.datadryad.org/ ) – an international

repository that manages the research data

underpinning peer-reviewed articles in the

biosciences.

– Landscape of public data archives is patchy; Dryad fills a gap http://www.youtube.com/watch?v=RP33cl8tL28

• Figshare (http://figshare.com/ ) - repository where

users can make all of their research outputs available

in a citable, shareable and discoverable manner.

Partnering with Taylor & Francis to host the

supplemental data to T&F published papers.

Recognition and Impact

• Others who re-use data and cite it in

their own research help to spread the

word about the research and increase its

impact

• Increased citation rates

• Piwowar H, Vision TJ. (2013) Data reuse and the

open data citation advantage. PeerJ PrePrints

1:e1v1

http://dx.doi.org/10.7287/peerj.preprints.1v1

Collaboration

• Data sharing may lead to new collaborations between

data users and data creators. Sharing data can often

lead to improvements such as corrections in the

documentation, or combination or comparison of

datasets leading to new information.

• Collaboration drives more research

Collaboration • NASA Landsat satellite imagery of Earth surface environment,

collected over the last 40 years was sold through the US

Geological Survey for US$600 per scene until 2008, when it

became freely available from the Survey over the internet.

• Usage leapt from sales of 19,000 scenes per year, to

transmission of 2,100,000 scenes per year. Google Earth now

uses the images.

• There has been great scientific benefit, not least to the

Geological Survey, which has seen a huge increase in its

influence and its involvement in international collaboration.

Funding application advantage

• Professor Derek Offord, Department of Russian was

awarded £800,000 from the Arts and Humanities Research

Council to conduct the first large-scale history of the

French language in Russia. However, upon initial

submission of the funding proposal, peer reviewers

criticised data sharing plans and suggested that the

application’s Technical Appendix should be rewritten.

• While the intellectual excellence of the proposal was not in

doubt, without resubmission with appropriate changes to

data sharing plans, the application would not have been

successful.

– Data management and data sharing plans are important : competitive advantage

• http://data.bris.ac.uk/files/2013/06/data-bris-benefits-report-V2.pdf

Barriers to sharing

• A huge amount of data ends up unpublished,

unshared and essentially wasted – another form

of data loss particularly for datasets that have

clear scope for wider research use, decision-

making, policy making and hold significant long-

term value

• Tension between the pressure to make data

more open earlier on and the real fear that

researchers have that if they do that others will

reap the benefits from the hard work they’ve

done

• Culture of “my” data

Why might public access to your

research data be restricted?

• “We intend to make a patent application, and

must avoid prior disclosure.”

• “Don’t want to make locations of members of

endangered species available to poachers.”

• “The research data are confidential because of

the arrangement my research group has made

with the commercial partner sponsoring our

research.”

• “My data form part of a long-term study upon

which my research group is entirely reliant for its

on-going research publications and academic

reputation. We only share this with trusted

colleagues.”

Examples of Repositories

• Earthchem www.earthchem.org/

– geosciences, with particular emphasis on geochemical, geochronlogical, and petrological data

• KNB http://knb.ecoinformatics.org

– biosciences, ecology, evolutionary biology

• GenBank http://www.ncbi.nlm.nih.gov/genbank/

– DNA sequence data

• Machine Learning Data Repository http://archive.ics.uci.edu/ml/datasets.html

– a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms

• Irish Social Science Data Archive www.ucd.ie/issda

Advantages of a repository

• Provides a metadata structure / metadata form

for you to fill in

• Publishes the data for you by giving your dataset

a unique identifier, e.g. DOI

• Serves as a backup vehicle for your data

• May preserve your data for the future

• Makes sharing your data easy

• Others may cite your research more

Locating relevant datasets using

“portals”

• Databib

– http://databib.org/

• Registry of Research Data Repositories

– http://www.re3data.org/

• CalPoly’s LibGuide

– http://libguides.calpoly.edu/content.php?pid=277668&sid=2288020

Using Google to locate data

• Astronomy dataset OR "data archive" OR

"data portal“

• hydrogeology OR groundwater dataset

OR "data archive" OR "data portal“

• migration dataset OR "data archive" OR "data

portal“

Help @ UCD • Data storage, backup and security

– UCD’s Research IT support team is available to discuss the options available to you regarding data storage or any of your IT requirements. Contact [email protected] or [email protected]

• www.ucd.ie/itservices/researchit/

• Intellectual property

– For queries regarding intellectual property and support for researchers interested in commercialisation please contact Caroline Gill, Innovation Education Manager [email protected]

• www.ucd.ie/innovation/researchers/

• Research ethics

– Research Ethics Administrator. One-to-one consultations with researchers who are about to submit for either a full review or exemption. Contact Jan Stokes, Research Ethics Administrator [email protected]

• www.ucd.ie/researchethics/

• Research data management checklist

– Some assistance can be given by UCD Library. Contact Julia Barrett, Research Services Manager, UCD Library [email protected]

• www.ucd.ie/library/supporting_you/research_support/data_management/

Finally….take away with you….

• STORE – Three copies on Three Disks in Three

Locations

• ORGANISE – If you make a plan, you just might

follow it

• DOCUMENT – What would my colleagues need to

know to understand this data?

• SHARE – Data makes an impact (to you, to your

research group, to society)