Preparing Data for Sharing: The FAIR Principles

Preview:

Citation preview

PREPARING DATA FOR

SHARING

The FAIR Principles

Gareth Knight

London School of Hygiene & Tropical Medicine

gareth.knight@lshtm.ac.uk

ADMIT Network Meeting

01 December 2015

FAIR Principles

Findable

• Descriptive metadata

• Persistent Identifiers

Accessible

• Determining what to share

• Participant consent and risk management

• Access status

Interoperable

• XML standards

• Data Documentation Initiative

• CDISC

Reusable• Rights and

licence models

• Permitted and non-permitted use

http://datafairport.org/

Make your data:• Findable• Accessible• Interoperable• Reusable

Data Sharing in the sciences

• Data sharing has always taken place in some form

• Enlightenment during 17 – 18th

century built upon open debate and sharing of knowledge

• Science depends on openness and transparency to advance– Replicate results

– Correct errors & address bias

• Negative as well as positive findings need to be in the public domain

“Systematic Dictionary of the Sciences, Arts, and Crafts”Diderot & d'Alembert (1751 onwards)

Data Sharing in the News

“To make progress in science, we need to be open and share.”Neelie Kroes (2012)

vice president of the European Commissionhttp://europa.eu/rapid/press-release_SPEECH-12-258_en.htm

“To make progress in science, we need to be open and share.”Neelie Kroes (2012)

vice president of the European Commissionhttp://europa.eu/rapid/press-release_SPEECH-12-258_en.htm

Key Motivators

Research / Policy development Ensure validity

Funder Requirement Publisher requirements

Data reuse improves citation rate

• Studies that made data available in a public repository received 9% more citations than similar studies where data was not available

• Creators tend to cite own data up to 2 years

• Third party use grew over time: for 100 datasets deposited in year 0,

– 40 reuse papers in PubMed in year 2

– 100 by year 4

– 150+ by year 5.

Piwowar & Vision, T.J (2013). Data reuse and the open data citation advantage. https://peerj.com/articles/175/

Study of 10,557 articles published between 2001 and 2009 that

collected gene expression microarray data

Plan for Sharing

Data Management Plan• Data to be produced

• Management approach

• Sharing approach

– In what form?

– When will it take place?

– How will it be shared?

PlanningData

CollectionDatabase

SetupData

Capture

Data Processing & curation

Archiving & sharing

https://globalhealthdatamanagement.tghn.org/data-dudes/tools-templates/

DATA

DISCOVERY

Is your data findable?

Discovery Metadata

• Descriptive metadata created to describe key attributes of data:– Title

– Creator

– Content description

• Data repositories/journals capture and publish discovery metadata in several formats (DC, DataCite, DDI)

• Metadata ‘harvested’ by research data catalogues & search engines

• Metadata available to all, even if data is not

Registry of Research Data Repositorieshttp://service.re3data.org

Registry of Research Data Repositorieshttp://service.re3data.org

Citing Data

• Research data are a citable resource, same as papers & books

• 44-75 days is the estimated average lifespan of web URLs

• A unique, long-term identifier is necessary to enable citation

• Many persistent ID systems developed to solve problem

– DOI, Handle, ARK, etc.

• Data citation in reports and publications

UK Data Service: Citing Datahttps://www.ukdataservice.ac.uk/use-data/citing-data

UK Data Service: Citing Datahttps://www.ukdataservice.ac.uk/use-data/citing-data

DATA

ACCESS

Do you have permission to share? If so, what?

Data Selection

Meet funder / journal obligations

Encourage research use

Higher citation rate

Reproduce & validate results

ConstraintsMotivation

Concern that will attract lower rate of response or people will be less honest

Intellectual Property Rights issues

Participant consent doesn’t address

sharing

Data Protection legislation

Data sharing decisions built uponrecognition of all influencing factors

Information Commissioner Office. Data Sharing Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/data_sharing/

Information Commissioner Office. Data Sharing Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/data_sharing/

Handling individual level data

• Collected and analysed for specific purpose

• Stored no longer than is necessary

• Kept securely and safely to prevent unauthorised or unlawful access, process, loss, or destruction

EU Data Protection Directive 95/46/EC establishes limitations on how information on living individuals is held and used

Reform of the data protection legal framework in the EUhttp://ec.europa.eu/justice/data-protection/reform/index_en.htmReform of the data protection legal framework in the EU

http://ec.europa.eu/justice/data-protection/reform/index_en.htm

Informed Consent

Covered data:

• Variables

• Anonymised / identifiable

Allowed activities:

• Use in current project, e.g. topics

• Preserve and archive with 3rd party

• Future research – access & use

Communication method:

• Information Sheet

• F2f discussion

Time period for decision:

• Prior to capture

• Following capture & review

https://globalhealthtrainingcentre.tghn.org/articles/informed-consent/https://globalhealthtrainingcentre.tghn.org/articles/informed-consent/

http://retractionwatch.com/2014/02/05/journal-and-authors-apologize-unreservedly-for-distress-caused-to-deceased-childs-family-by-case-report/

Data Sharing as a barrier

Investigation of influence of open data policies on consent rate:

• No participants declined to participate, regardless of condition

• Rates of drop-out vs completion did not vary between open/non-open policies

• No significant change in potential consent rates when participants openly asked about the influence of open data policies on their likelihood of consent.

Some researchers consider sharing obligations to be abarrier to research participation

Risk Management

Assess likelihood that data can be used to:

• Identify a person directly

• Infer information about a person

• Link records relating to person to other info

Determine action to address issue:

• Randomisation - noise addition, permutation

• Generalisation - aggregating results, limiting geographic details

• Pseudonymisation - hash functions

Is there a risk of sharing personal or sensitive information?

UK Information Commissioner Office: Anonymisation Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation

UK Information Commissioner Office: Anonymisation Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation

https://www.flickr.com/photos/estherase/2190068148

When anonymisation goes wrong

New York City Taxi & Limousine Commission release anonymised 20 GB file on 173 million

journeys under FOI

Drivers' Hack License & Medallion number re-generated, identifying drivers annual income

Identify home address and destinations of residents

Identify journeys made by celebrities?

http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/http://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/

Access Status

Control method

• Data Transfer Agreement

• Access controls

Application process:

• Request form

• Review process

Access criteria:

• Permitted users – how do you identify?

• Permitted use – topic, academic use,

• Other criteria: encryption, time period

Open Vs. controlled access

https://www.flickr.com/photos/toruokada/16958186672/

DATA

INTEROPERABILITY

Can data be analysed and harmonized?

Data Standards

Data exchange is dependent upon:

• Open formats

• Common standards

• Documented metadata specification

• Consistent vocabulary

• Documented workflows https://biosharing.org/

Clinical Data Interchange

Standards Consortium

Standards intended to improve consistencyacross the clinical trial lifecycle

ProtocolProtocolData

CollectionData

CollectionData

TabulationData

TabulationData

AnalysisData

Analysis

Archiving and

exchange

Archiving and

exchange

Protocol Representation

Model

Clinical Data Acquisition Standards

Harmonization (CDASH)

Operational DataModel (ODM)

andDefine-XML

Study Data Tabulation

Model(SDTM)

AnalysisData Model

(ADaM)

Data Documentation Initiative

• Maintained & developed by DDI Alliance

• Supported by data archives, producers, research data centers, university data libraries, statistics organizations, etc.

• Two versions:

– DDI2 / Codebook: An archived instance of a study

– DDI3 / DDI Lifecycle: Suitable for longitudinal and repeated surveys

An XML-based metadata standard developed for social science

and economic statistics

http://www.ddialliance.org/

Study

ConceptsConcepts

measures

SurveyInstruments

using

Questions

made up of

Universes

about

Responses

collect

resulting in

with values of

Variables

Comprised of

Categories/Codes,

Numbers

Data Files

Survey Data Model

Slide source:

https://www.unece.org/fileadmin/DAM/stat

s/documents/ece/ces/ge.33/2011/mtg2/W

P_1_Arofan.ppt

DDI Codebook

A codeBook consists of:

1. docDscr: describes the DDI document

2. stdyDscr: Title, abstract, methodologies, agencies, access policy

3. fileDscr: a description of files in the dataset

4. dataDscr: variables (name, code, etc.), variable groups, cubes

5. othMat: other related materials, e.g. document citation

3 levels - Study, dataset, variable

Preserves the collection of files associated with

an archival copy of a survey

DDI Lifecycle

http://www.ddialliance.org/what

Data collector

Data Analyst Data Curator

Secondary user

Each stage may be performed by different groups

DDI Metadata reuse

Basic metadata can be reused during study life:

• Concepts, questions, responses, variables, categories, codes, survey instruments, etc. may be adopted from earlier waves

Referencing earlier iterations:

• Unique identifier

• Version number - control over time

Common metadata ‘groups’ maintained by specific agencies:• Schemes: lists of items of a single type

• Modules: metadata for a specific purpose or lifecycle stage

• All maintainable metadata has a known owner or agency

Unique ID example

urn=“urn:ddi:3_0:VariableScheme.Variable=pop.umn.edu:STUDY0145_VarSch01(1_0).V101(1_1)”

This is a URN From DDI Version 3.0 For a variableThe scheme agency is

pop.umn.edu

With identifierSTUDY012345_VarSch01

Version 1.0 Variable ID isV101

Version 1.1

http://www.iza.org/conference_files/eddi09/ppt/thomas_wendy_course.pdf

DDI Cross-study comparison

Variables are comparable if they possess same properties:

• Age is comparable if has:– Same concept (e.g., age at last birthday)

– Same top-level universe (people)

– Same representation (i.e., an integer from 0-99)

DDI Comparison module:• Place similar items in same group and perform tailored comparison

• Mappings are context-dependent, i.e. sufficient for purposes of particular research

DDI Tools

DDI Codebook:

• Nesstar Publisher & Server

• IHSN Microdata Management Toolkit

• Collectica

• NADA

• UKDA - DExT, ODaF DeXtris

DDI Lifecycle

• Collectica Designer, Collectica for Excel, Portal

• Sledgehammer

DDI Toolshttp://www.ddialliance.org/resources/tools

DDI Toolshttp://www.ddialliance.org/resources/tools

DATA

REUSE

Can data be used for further research?

Data Rights

• Many rights apply to data– Copyright

– Moral

– Database

– Patents & trade secrets

• Rights issues vary between countries

• Ensure your project has clarified rights issues before sharing

https://www.flickr.com/photos/riekhavoc/4813140176/

Rights issues influence how data can be shared, used and cited

Data Licence Models

Many licence models exist, which can be applied at different granularity

• Creative Commons

• Open Data Commons

• GNU GPL, BSD and others for software

Do you have a standard Data Sharing Agreement within your institution?

A data licence outlines permitted & prohibited use

What secondary use is allowed?

http://www.bbc.co.uk/news/uk-scotland-tayside-central-14744240http://www.theguardian.com/society/2011/sep/01/cigarette-university-smoking-research-information

FAIR data

• Consider permitted use

• Apply appropriate licence

• Use open formats

• Consistent vocabulary

• Common metadata standards

• Consider what will be shared

• Obtain participant consent & perform risk management

• Describe your data in a data repository

• Apply a persistent identifiers

Findable

ReusableInteroperable

Accessible

Thank You for your attention!

Questions

Recommended