Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
An Leabharlann UCD
Julia Barrett
UCD James Joyce Library
Data Caring: Why Manage
Your Research Data?
Outline
•Research data: definition
•Data management drivers
•Data management benefits
•Data management components
•Documentation practices
•File management
•Storage
•Access management
•Data sharing
•Data repositories
•Help @ UCD
What is research data?
“The data, records, files or other evidence, irrespective of their content or
form (e.g. in print, digital, physical or other forms), that comprise research
observations, findings or outcomes, including primary materials and analysed
data.” – Australian National Data Service
Examples:
•Statistics and measurements
•Results of experiments or simulations
•Laboratory notebooks
•Observations e.g. fieldwork
•Survey results – print or online
•Interview recordings and transcripts
•Images, from cameras and scientific equipment
Five Top Reasons to Protect Your Data
and Practise Safe Science
https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/
1. Data output is growing rapidly
• 90% of all the data in the world has been
generated over the last 2 years. Scientific
data output is currently increasing at an
annual rate of 30%.
• Graphic
2. Despite significant investment, data is
not being managed effectively
• $1.5 trillion is the current estimated total global
spend on research and development, which could be
at risk. Much of the data generated is lost – in one
study, the odds of sourcing datasets declined by
17% each year, with 80% of datasets over 20 years
old not available.
80% of datasets over 20 years old not
available
www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
3. Much of the data remains unverifiable
• 54% of the resources used across 238
published studies could not be identified,
making verification impossible.
4. Time and money is wasted, impacting
on science and society
• Since 2000, over 80,000 patients have taken
part in clinical trials based on research that
was later retracted because of error or fraud.
The number of retractions due to error has
grown over fivefold since 1990.
5. Funders increasingly require data
management and sharing policies
• Key funding bodies such as the NIH, MRC
and Wellcome Trust now request data
management plans be part of
applications.
Why manage data?
I want to be able to find my data two years from now
My colleague left 6 months ago and I can’t make any sense of his data
On a recent train journey my shampoo leaked into my laptop and I lost my files….
I dropped my laptop, lost about 2 months of work, yeah I know, should make backups more often, but I could never think of this happening.
Lost data
At more than 500 laundromats and dry cleaners in the UK, 17,000 USB flash drives were left behind between December 2010 and January 2011. According to the study’s researchers at Credant Technologies, that’s a 400 percent increase inlost devices compared to the year before.
•http://blog.allusb.com/2011/03/rising-trend-in-lost-usb-flash-drives/
• http://www.connachttr
ibune.ie/galway-
news/item/1372-
blaze-that-destroyed-
galway-hse-unit-
suspicious-say-gardai
Benefits of Managing Data
• Saves time – being able to find
things
• Reduces possibility of data loss
through managed back-ups,
storage and security processes
• Reduces errors e.g. badly
described data, confusion
between file versions
• Enables you and others to find
and understand what you have
done through the provision of
descriptions, metadata, file
management etc.
• Provides evidence of work
undertaken
• Provides evidence of validity
of work undertaken
• Verifies – provides evidence
of logical processes and
methods
• Ensures retraceability and
reproducibility
Data Management Plan (DMP) A data management plan is a formal and practical document developed at the start of a research project which outlines all aspects of the data, including:
• The nature of your data
• How it is organised and described
• How it is shared with others
• How it will be stored in the long-term – https://www.admin.ox.ac.uk/rdm/dmp/plans/
Developing a data management plan helps to ensure the research data are accurate, complete, reliable, and secure both during and after completion of the research.
Funding bodies increasingly require that grant applications include data management plans. For example, current NSF (National Science Foundation - American funding agency) policy states that as of January 18, 2011 all NSF proposals must have a supplementary document of no more than two pages labelled data management plan.
http://www.youtube.com/watch?v=Lc82pxxRkMo
Data management components
Storage
• Backup strategy; Security
File management
• File organisation; File naming; File formats
Documentation practices
• Project documentation; process documentation; data documentation
Access management
• Data sharing; publishing; archiving
Data Management Checklist
http://www.ucd.ie/t4cms/Guide121.pdf
Documentation practices
PROJECT
DESCRIPTION
• Project title
• The aim/ purpose of
the research
• Project duration
INVENTORIES
• Servers, directories,
data, lab equipment
etc.
CONTEXT
• Principal investigator
• Researchers/other project
members
• Main contact details
• Collaborators/Partner
Institutions
• Roles and responsibilities
• Funding source(s) and
requirements
• Budget
Documentation practices
• Processes
– Sometimes individual effort, sometimes collaborative
– Protocols, code commentary
– Workflow descriptions/diagrams
• Data Capture
– How will data be created?
– Any special hardware / software requirements?
– How will metadata be captured, created and managed?
Data documentation • “A crucial part of making data user-friendly, shareable and
with long-lasting usability is to ensure they can be
understood and interpreted by any user. This requires
clear data description, annotation, contextual information
and documentation
• Data documentation explains how data were created or
digitised, what data mean, what their content and
structure are, and any manipulations that may have taken
place. It ensures that data can be understood during
research projects, that researchers continue to understand
data in the longer term and that re-users of data are able
to interpret the data. Good documentation is also vital for
successful data preservation.” (UK Data Archive).
• Good documentation ensures your data can be:
– Searched for and retrieved
– Understood now and in the future
– Properly interpreted, as relevant context is available.
Metadata Definition
Metadata is...data about data
“Structured information that describes,
explains, locates, or otherwise makes it easier
to retrieve, use, or manage an information
resource”.
It enables:
• resource discovery and retrieval
• data sharing and reuse – allows data to be
interpreted or analysed by others
• management of resources – records aspects of
the production and curation process, rights
information, location and access information
Understanding metadata, NISO, 2004
Metadata: Data about Data
Three broad categories of metadata are:
Descriptive - common fields such as title, author,
abstract, keywords which help users to discover
online sources through searching and browsing.
Administrative - preservation, rights management,
and technical metadata about formats.
Structural - how different components of a set of
associated data relate to one another, such as a
schema describing relations between tables in a
database.
Metadata answers basic questions about data
• Who created and maintains the data? Who can access it?
• Why was the data created?
• What is the content and structure of the data? What changes have been made to it?
• When collected? When published?
• Where is the geographic location?
– Where is the data held?
• How was the data produced?
“What information would I need to understand and use this data in twenty years?”
Considerations for choosing a standard
• The discipline, domain
• The format of the data
• Repository or funder requirements
• Recognition and/or certification of standard
• Controlled vocabularies, thesauri, and
authorities
• Available metadata tools
• Skills required and time available
Metadata creation tools
www.earthchem.org/data/templates
• http://www.earthchem.org/data/templates
Metadata creation tools
Knowledge Network for Biocomplexity
Morpho User Guide
https://knb.ecoinformatics.org/software/morpho/MorphoUserGuide.pdf
Organising Data & File Formats
• File structure
• Folder structure
• File and folder naming conventions
• Versioning
• File formats
– Choose platform and vendor-independent file formats to ensure the best chance for future compatibility
• File transformation
Organising Data: Good File Management
• Research data files and folders need to be labelled and organised in a
systematic way so that they are both identifiable and accessible for current and
future users. The benefits of consistent data file labelling are:
Data files are distinguishable from each other within their containing folder
Data file naming prevents confusion when multiple people are working on shared files
Data files are easier to locate and browse
Data files can be retrieved not only by the creator but by other users
Data files can be sorted in logical sequence
Data files are not accidentally overwritten or deleted
Different versions of data files can be identified
If data files are moved to other storage platform their names will retain useful context
http://www.youtube.com/w
atch?v=Z_ysxiAGKC8&featu
re=player_embedded
• Use a combination of
different types of
information to make
the context and
content of a file
clear, e.g.
– Data source
– Measured variable
– Experiment
– Date
“AHRC_TechnicalApp_Response20120925.docx” rather than:
“what we got back from funders about the data stuff.docx”
Data storage and backup
• Estimated size of data ; growth
rate
• Where (physically) will you
store the data? Server,
pc/laptop, external storage
device…geographically
distributed…
• On what media will you store
the data?
• Whose responsibility is the
storage of the data?
• How will you transmit the data,
if required?
• How is your data backed up?
• How often is your data backed
up?
• Who is responsible for this?
• Avoid single points of
error
– Use managed networked storage whenever possible
– Move data off portable media
– Make multiple backups: Lots of Copies Keeps Stuff Safe (LOCKSS)
– Be wary of software lifespans
Data security
• How will you ensure the security of your
data?
– How will data be shared during the project?
– How will you organise access to sensitive data?
– How will you enforce permissions, restrictions and embargoes?
– Other security issues e.g. damage, theft
• Information Security Risk Assessment for UCD Research Groups online survey
– https://docs.google.com/a/ucd.ie/spreadsheet/viewform?formkey=dGV3QUF4UGxTaTkweDFJWlhiU1g2VVE6MA#gid=0
Access management: Ethics and IP
• Are there any ethical or privacy issues that may
prohibit the sharing of some or all of the dataset/s?
• If so what possible ways might there be to resolve
these? (E.g. referral to UCD’s Ethics Committee;
anonymisation of data; formal consent
agreements; different levels of access to data, e.g.
research purposes only, no commercial)
• Who owns the copyright and other intellectual
property?
Data sharing: drivers and benefits
• Facilitating research and discovery
• Scientific integrity
• Funders and government
• Journal publishers
• Recognition and impact
• Collaboration
• Funding application advantage
Facilitating research and discovery
• Cucumbers, E-coli
and open data: The
2011 outbreak of E. coli
poisoning in Germany
illustrated the changes in
attitudes to sharing scientific
research and data; within
weeks of the outbreak, the
genome of the bacteria was
identified, and given the
seriousness of the outbreak,
the results were published on
the Internet as soon as they
were available.
Facilitating research and discovery
• “As research becomes more data intensive,
research datasets increase in number and size.
Re-using (combinations of) research datasets
produced by researchers in the same discipline or
from different disciplines brings about novel
approaches, such as data exploration, simulation
and modelling, system level science, and
transdisciplinary research”.
• Van der Graaf, M. and Waaijers, L. (2011). A Surfboard for Riding the Wave. Towards a
four country action programme on research data. A Knowledge Exchange Report, available
from www.knowledge-exchange.info/surfboard
Scientific integrity
• Publishing research data and citing its location in
published research papers allows others to replicate,
validate or build upon your results thus improving
the scientific record by encouraging scientific
enquiry and debate.
• Openly sharing research data also encourages the
improvement and validation of research
methods and minimises the need for data re-
collection.
• Verify results; uncover errors
• Contains errors and excludes some data that
significantly undermined the results.
• The results were published in a prestigious
journal, the American Economic Review, that
failed to enforce its own data availability policy
Funders and Government
• NERC (Natural Environment Research Council)
expects everyone that it funds to manage the data
they produce in an effective manner for the lifetime
of their project, and for these data to be made
available for others to use with as few restrictions as
possible, and in a timely manner.
• To protect the research process NERC will allow those
who undertake NERC-funded work a period of time to
work exclusively on, and publish the results of, the
data they have collected. This period will normally be
a maximum of two years from the end of data
collection.
• UK Funders’ Data Policies: www.dcc.ac.uk/resources/policy-and-
legal/funders-data-policies
Funders and Government
• Research data in general should be deposited
whenever this is possible, and linked to associated
publications where this is appropriate. It should be
made openly accessible, in keeping with best practice
for reproducibility of scientific results.
– European and national data protection rules must be taken into account in relation to research data, as well as concerns regarding trade secrets and intellectual property rights, confidentiality, or national security.
– At a minimum, metadata describing research data and its location and access rights should be deposited.
– It is recognized that managing access to research data may be a new approach for many research organisations. This policy is intended to encourage the improvement of discoverability and development of open access to research data over time.
National Advisory Committee on Drugs and
Alcohol: NACDA
Research Data Management Policy
• Specifies:
– Copyright (owned by NACDA)
– Data quality
– Provision of supporting material
– Data security
– Confidentiality & protection of personal data by data anonymisation
– Data sharing
– Deposit to a data archive / repository
– Informed consent to not preclude data sharing beyond the original research
– Data management plan requirement
Funders and Government
• Minister Howlin said “The public service needs to share data
to deliver services, and more data-sharing will be necessary
to deliver the joined-up services we aspire to. At the same
time data protection and privacy are concerns for all of us.
Today we agreed that a new legal framework is required to
enable the public service deliver the next generation of
services both effectively and securely.”
Journal Publishers
• Increasing number of journal publishers require the
sharing of associated data
• DRYAD (www.datadryad.org/ ) – an international
repository that manages the research data
underpinning peer-reviewed articles in the
biosciences.
– Landscape of public data archives is patchy; Dryad fills a gap http://www.youtube.com/watch?v=RP33cl8tL28
• Figshare (http://figshare.com/ ) - repository where
users can make all of their research outputs available
in a citable, shareable and discoverable manner.
Partnering with Taylor & Francis to host the
supplemental data to T&F published papers.
Recognition and Impact
• Others who re-use data and cite it in
their own research help to spread the
word about the research and increase its
impact
• Increased citation rates
• Piwowar H, Vision TJ. (2013) Data reuse and the
open data citation advantage. PeerJ PrePrints
1:e1v1
http://dx.doi.org/10.7287/peerj.preprints.1v1
Collaboration
• Data sharing may lead to new collaborations between
data users and data creators. Sharing data can often
lead to improvements such as corrections in the
documentation, or combination or comparison of
datasets leading to new information.
• Collaboration drives more research
Collaboration • NASA Landsat satellite imagery of Earth surface environment,
collected over the last 40 years was sold through the US
Geological Survey for US$600 per scene until 2008, when it
became freely available from the Survey over the internet.
• Usage leapt from sales of 19,000 scenes per year, to
transmission of 2,100,000 scenes per year. Google Earth now
uses the images.
• There has been great scientific benefit, not least to the
Geological Survey, which has seen a huge increase in its
influence and its involvement in international collaboration.
Funding application advantage
• Professor Derek Offord, Department of Russian was
awarded £800,000 from the Arts and Humanities Research
Council to conduct the first large-scale history of the
French language in Russia. However, upon initial
submission of the funding proposal, peer reviewers
criticised data sharing plans and suggested that the
application’s Technical Appendix should be rewritten.
• While the intellectual excellence of the proposal was not in
doubt, without resubmission with appropriate changes to
data sharing plans, the application would not have been
successful.
– Data management and data sharing plans are important : competitive advantage
• http://data.bris.ac.uk/files/2013/06/data-bris-benefits-report-V2.pdf
Barriers to sharing
• A huge amount of data ends up unpublished,
unshared and essentially wasted – another form
of data loss particularly for datasets that have
clear scope for wider research use, decision-
making, policy making and hold significant long-
term value
• Tension between the pressure to make data
more open earlier on and the real fear that
researchers have that if they do that others will
reap the benefits from the hard work they’ve
done
• Culture of “my” data
Why might public access to your
research data be restricted?
• “We intend to make a patent application, and
must avoid prior disclosure.”
• “Don’t want to make locations of members of
endangered species available to poachers.”
• “The research data are confidential because of
the arrangement my research group has made
with the commercial partner sponsoring our
research.”
• “My data form part of a long-term study upon
which my research group is entirely reliant for its
on-going research publications and academic
reputation. We only share this with trusted
colleagues.”
Examples of Repositories
• Earthchem www.earthchem.org/
– geosciences, with particular emphasis on geochemical, geochronlogical, and petrological data
• KNB http://knb.ecoinformatics.org
– biosciences, ecology, evolutionary biology
• GenBank http://www.ncbi.nlm.nih.gov/genbank/
– DNA sequence data
• Machine Learning Data Repository http://archive.ics.uci.edu/ml/datasets.html
– a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms
• Irish Social Science Data Archive www.ucd.ie/issda
Advantages of a repository
• Provides a metadata structure / metadata form
for you to fill in
• Publishes the data for you by giving your dataset
a unique identifier, e.g. DOI
• Serves as a backup vehicle for your data
• May preserve your data for the future
• Makes sharing your data easy
• Others may cite your research more
Locating relevant datasets using
“portals”
• Databib
– http://databib.org/
• Registry of Research Data Repositories
– http://www.re3data.org/
• CalPoly’s LibGuide
– http://libguides.calpoly.edu/content.php?pid=277668&sid=2288020
Using Google to locate data
• Astronomy dataset OR "data archive" OR
"data portal“
• hydrogeology OR groundwater dataset
OR "data archive" OR "data portal“
• migration dataset OR "data archive" OR "data
portal“
Help @ UCD • Data storage, backup and security
– UCD’s Research IT support team is available to discuss the options available to you regarding data storage or any of your IT requirements. Contact [email protected] or [email protected]
• www.ucd.ie/itservices/researchit/
• Intellectual property
– For queries regarding intellectual property and support for researchers interested in commercialisation please contact Caroline Gill, Innovation Education Manager [email protected]
• www.ucd.ie/innovation/researchers/
• Research ethics
– Research Ethics Administrator. One-to-one consultations with researchers who are about to submit for either a full review or exemption. Contact Jan Stokes, Research Ethics Administrator [email protected]
• www.ucd.ie/researchethics/
• Research data management checklist
– Some assistance can be given by UCD Library. Contact Julia Barrett, Research Services Manager, UCD Library [email protected]
• www.ucd.ie/library/supporting_you/research_support/data_management/