70
Data management Responsible Conduct of Research Seminar Series UC Berkeley April 16, 201

Scientific data management

  • Upload
    jeffloo

  • View
    830

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scientific data management

DatamanagementResponsible Conduct of Research

Seminar SeriesUC Berkeley

April 16, 2012

Page 2: Scientific data management

Who are you?

Jeffery Loo, PhD

“Flying books”Installation by J. Ignacio Diaz de Rabago

UC Berkeley Library

Page 3: Scientific data management

NSF data management plan

Requirement as of January 18,

2011

Your plans to organize, store, and share data

http://www.nsf.gov/bfa/dias/policy/dmp.jsp

Page 4: Scientific data management

“My Data Management Plan – a satire”

Dr. C. Titus BrownAssistant ProfessorMichigan State University

Source

Page 5: Scientific data management

Dear NSF,

I am happy to respond to your request for a 2-page Data Management Plan.

First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". Ever since I started working with large Avida data sets in 1993, […] I have seen the need for a systematic plan to manage the data. It is nice to see NSF stepping up to the plate in such a timely manner, and I am happy to comply.

Now, as to my actual data management plan, here is how I plan to deal with research data in the future.

I will store all data on at least one, and possibly up to 50, hard drives in my lab.

The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is.

Page 6: Scientific data management

Backups will rarely, if ever, be done.

When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.

[….]

Note, we didn't use a version control system, either. […] And our repository is not publicly available - you have to beg for permission. Note, I only answer e-mail on every other Tuesday.

Page 7: Scientific data management

Any design notes on the data analysis are in our private e-mail, and we will fight to the death -- up to and including ignoring FOIA requests -- to prevent you from obtaining them.

Meanwhile we will continue publishing exciting sounding (but irerproducible) analyses, and submitting grants based on them, because that's the only thing that the reviewers care about.

sincerely yours,

--titus

(representing every computational scientist in the world.)

Page 8: Scientific data management

Data challenges

Distributed, uncoordinated effort

Concerns about data re-use

Data management may be ad lib“Can’t you ever relax?”

Informal data management practices

Page 9: Scientific data management

Lots to do!

Ensure long-term access

Facilitate sharing

Prepare for future re-use

Page 10: Scientific data management

Data activities in

the research workflow

Source:http://www2.lib.virginia.edu/brown/data/lifecycle.html

Page 11: Scientific data management

Lots of different research products

Models and computational simulations Images, photographs, audio, and video

Instrument readings Maps

Software Artifacts and samples

Physical collections And more …

Page 12: Scientific data management

Goal for this lunch hour

Review “first steps” in data management

Saving dataDescribing/documenting dataSharing dataData management planningData ethics

Page 13: Scientific data management

Common sense versus common practice

Page 14: Scientific data management

Saving data

Page 15: Scientific data management
Page 16: Scientific data management

Hall of fame anecdote

http://www.youtube.com/watch?v=J6HtRWyiL98

Page 17: Scientific data management

Where do you store data safely?

Traditional storage not always sufficientPersonal computersDepartmental/university servers

Two additional types of storageArchives and repositoriesCloud storage (storing files in an online site)

Page 18: Scientific data management

Archives and repositories

Special types of online storage sites

Long-term storage, management, and preservation

Search, download, and analytic functionalities

Page 19: Scientific data management

Institutional archives and repositories

Merritthttp://merritt.cdlib.org/

Data repository management services at UCBhttp://ist.berkeley.edu/ds

Page 20: Scientific data management

Public archive and repository

Long-term access, open to the public

GenBankhttp://www.ncbi.nlm.nih.gov/genbank/

Page 21: Scientific data management

3rd party cloud storage

Amazon S3Google Docs

Dropbox

Beware of posting sensitive data/files

Page 22: Scientific data management

Deciding on storage

Consider:Permanence Oversight Security

Page 23: Scientific data management

Save for long-term access

Recommended file formats• Non-proprietary• Uncompressed and unencrypted (okay to encrypt sensitive

data)• Common usage by your research community• Standard representation (e.g., ASCII text, Unicode)

Page 24: Scientific data management

1 2 3

Original master Local external storage

Remote external storage

UC Berkeley IST backup services

3rd party services (Amazon S3,  Elephant Drive,  Jungle Disk, Mozy, Carbonite Free, Dropbox)

Email a copy to yourself

Backup 3 copies

Page 25: Scientific data management

Describing anddocumenting data

(metadata)

Page 26: Scientific data management

What countries have a five-pointed star on their national flag?

Page 27: Scientific data management

DOI: 10.1126/science.1207745

“outsourcing” our memory

“we don’t remember information as well, when we expect to find it on a computer later”

Page 28: Scientific data management

If we outsource our memory to computers …

We need good organization structures toFind data from the past quickly and completelyUnderstand data from the past

It helps toDocument and describe data“Assign metdata”

Page 29: Scientific data management

What do you document?

Descriptivemetadata elements

Administrative metadata elements

Structural metadata elements

Title Creator or contact Date Experimental conditions MethodologyVersion

Dictionary or codebook to explain the data variables

Tools and software needed for processing or visualizing the data

File formats

File names

Page 30: Scientific data management

How to record metadata

writemetadata

save asreadme.txt

store in file folder with data

Option 1

Page 31: Scientific data management

Metadata form/file in an archive/repository

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

Option 2

Page 32: Scientific data management

Annotate

<title>Effect of salt on ice cream production efficiency</title> <temperature>0</temperature>

XML, a popular system for annotating datahttp://www.w3schools.com/xml/

Option 3

Page 33: Scientific data management

Assign descriptive namesDescriptive file names

Descriptive folder names 

  Consider these elements:• project title• experimental

conditions and group• trial numbers• file version number

indicating data modifications• date or time stamps• author initials

data1.csv 75-celsius-trial_control_ver002.csv

Data > 1 > raw    >> part A    >> 110904 > readings

Project-title > Trial 1    >> Experimental    >> Control > Trial 2 > Trial 3

Page 34: Scientific data management

Australia

Brazil

Cape Verde

Ethiopia

United States of America

Page 35: Scientific data management

Sharing data

Page 36: Scientific data management

Historic data sharing

Anagrams to secure discoveriesVersus the “open science revolution” of journals

today

Galileo Newton Huygens Hooke

Page 37: Scientific data management

Open scienceShare research data, products, and

communications openly

Potential benefitsProtects unique data that cannot be readily replicatedReinforces open scientific inquiryEncourages diversity of analysis and opinionPromotes new lines of researchMakes possible the testing of new or alternative

hypotheses and methods of analysisSupports studies on data collection methods and

measurementFacilitates the education of new researchersEnables the exploration of topics not envisioned by the

initial investigatorsPermits the creation of new datasets when data from

multiple sources are combinedProvides content for scientific education

Page 38: Scientific data management

Data sharing examples

Crystal structure of M-PMV retroviral protease

Page 39: Scientific data management

Private sector too!

Cross-sector data sharing for

Alzheimer’s researchhttp://www.adni-info.org

(News story)

Page 40: Scientific data management

Increased citation rate

Page 41: Scientific data management

Funding agency policies

NIH Data Sharing Policy

NSF Data Sharing PolicyData management plan for grant

applications

Page 42: Scientific data management

Journal expectations

Data sharing as a term of publication

Page 43: Scientific data management

How do I share?

Personal sharing

Share-upon-requestEmail me for a copy!

Self-archiveDownload from my personal website!

Page 44: Scientific data management

Journal publishing

Page 45: Scientific data management

Institutional archive or repository

UC3 Merritt repository

Page 46: Scientific data management

Public archive or repository

The Ancient Agora of Athens

Ideal characteristicsPopular with national/global coverageSpecific to your disciplineOffers long-term preservation

Find an archive/repositoryAsk colleaguesSearch http://databib.lib.purdue.edu/

Page 47: Scientific data management

Public versus institutionalarchives and repositories

Institutional archives/repositoriesMay restrict to a smaller audienceMay offer greater control of your data

Public archives/repositoriesCreate comprehensive dataset for a larger

research problem spaceDomain-specific archives/repositories may

provide better support

Page 48: Scientific data management

Help others find your data

Berkeleywww.berkeley.edu/mystuff/super-data.csv

Stanfordwww.stanford.edu/mystuff/super-data.csv

file moves to

old URL is kaput

Page 49: Scientific data management

DOI Digital object identifier

Resolve DOIby visiting http://dx.doi.org/ followed by DOI

File can move, but DOI remains the sameThe DOI record stores location details

Try permanent identifiers

Page 50: Scientific data management

Generate permanent identifiers

request your free account, by emailing [email protected]

http://n2t.net/ezidSubscription through the UCB Library

Page 51: Scientific data management

Final tips for sharing

Be selective

Recognize restrictions (privacy and confidentiality)

Online services for sharing among your teamResearch Hub3rd party services

Page 52: Scientific data management

Data management planning

Page 53: Scientific data management

What is a data management plan?

A plan for organizing, storing, and sharing data

Page 54: Scientific data management

Planning associated with greater self-control for exercisemedical adherenceself-health exams sunscreen useschoolworkrefraining from a negative

behavior

Source: Townsend and Liu, 2012

Perhaps planning helps for data management

Page 55: Scientific data management

Why have a plan?

Prepare for efficient and quality data collection that is safe and shareable

NSF and NIH requirements

Page 56: Scientific data management

requirements

Data management plan≤ 2 pagesdescribes how data will be managed, disseminated, and shared

Plan undergoes peer review

Page 57: Scientific data management

Writing an NSF data management plan

Specific requirements vary by NSF divisions

In general, describe:Types of research data and materials producedStandards for data format, content, and metadataPolicies for access and sharingPolicies for re-use, re-distribution, and derivativesPlans for archiving and preserving

You can explain why data will not be shared

Examples1 and 2

Page 58: Scientific data management

NIH requirements

Timely data sharing encouraged

If requesting ≥ $500k per year, a plan is required

Describe how data will be sharedor why sharing is not possible

In the final progress report, describe data sharing actions taken

Page 59: Scientific data management

Writing an NIH data sharing plan

A brief paragraph

Suggested topicsSchedule for sharingFormat of the dataDocumentation of the dataAnalytic tools providedData-sharing agreements (criteria and conditions)Mode of data sharing

there was a

beautiful scientist

Page 60: Scientific data management

NIH plan example 1

The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers,

we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects.

Therefore, we are not planning to share the data.

Page 61: Scientific data management

NIH plan example 2This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years.

Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/

User registration is required in order to access

or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource.

Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to

users will not be used for commercial purposes, and will not be redistributed to third parties.

Page 62: Scientific data management

Library guidance

Guides, templates, exampleshttp://www.lib.berkeley.edu/sciences/data/guide

Page 63: Scientific data management

Online service for building data plans

https://dmp.cdlib.org/

Step-by-step instructions for meeting funding

agency requirements

Page 64: Scientific data management

Data ethics

Page 65: Scientific data management

Study by Martinson et al., 2005Source - doi:10.1038/435737a

Motivated by increasing pressureto publish papers and win grants?

3247 respondents

0.3% admitted to falsification or “cooking” research data

About 1 in 3 confessed to committing at least one of 10 serious misbehaviors

Page 66: Scientific data management

Citing data

Page 67: Scientific data management

Prevent distortions and manipulations

Keep raw original data

Log all changes made

Page 68: Scientific data management

Data licensing

Restrictions on data use, for example

No for-profit useNo re-sharingGive attribution

Check for license/terms of use

Page 69: Scientific data management

Stay current with data requirements

Review for changes to policies byFunding agenciesUniversity regulationsFederal and state governments

Page 70: Scientific data management

Haiku summary

Data is precious Safely store and share widelyGood for your career