Upload
carly-strasser
View
1.093
Download
6
Embed Size (px)
Citation preview
Data Stewardship
Carly Strasser California Digital Library [email protected] SPATIAL / IsoCamp
June 2014
Tips & Tools
From
Flic
kr v
ia lib
raria
ninst
a.tu
mbl
r.com
I am not a librarian. But I do work at a library.
Enable data sharing Encourage
new incentives
Think about code sharing
Work with libraries, publishers and
researchers
Explore new tools to help
change system
Build tools
Why are you here?
Science: you’re (probably) doing it wrong
Back in the day…
Da Vinci
Curie Newton
classicalschool.blogspot.com
Darwin
Research has changed
Better
From wikimedia
Such Internet!
So many tools!
From Flickr by John Jobby
So much data!
Research has changed Worse
Digital data Fr
om F
lickr
by
Flick
mor
From
Flic
kr b
y US
Arm
y En
viron
men
tal C
omm
and
From
Flic
kr b
y D
W08
25
C. Strasser
Cour
tese
y of
WHO
I
From
Flic
kr b
y d
eltaM
ike
Digital data +
Complex workflows
Scientists are bad at data management.
An embarrassing example…
From Flickr by lincolnblues
?
From Flickr by ransomtech
Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow
From Flickr by ransomtech
Reproducibility Transparency Reuse NO
From Flickr by johntrainor
Why should I care?
Because reproducibility* is one of the fundamental tenets of science. *reproducibility: being able to go from data to figures/results
not reproducibility: independently verifiable via following same techniques.
Because reproducibility is one of the fundamental tenets of science.
Because we need to be credible.
Because reproducibility is one of the fundamental tenets of science.
Because we need to be credible.
Because Fox News, creationism, and the war on science.
“Help us identify grants that are wasteful or that you don’t think are a good use of taxpayer dollars.” ! Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science and Technology
Because reproducibility is one of the fundamental tenets of science.
Because we need to be credible.
Because Fox News, creationism, and the war on science
Because it means faster progress.
Because you are a good person.
From Flickr by Redden-McAllister
From Flickr by Ken Cowell
From Flickr Brandi Jordan
Open Science Making data research dissemination
available to all
flowingdata.com
Map of Scientific Collaborations
Because you have to.
Journals Institutions Funders From Flickr by Eva Rinaldi Celebrity and Live Music
Photographer
… “Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.”
Feb 2013
1. Maximize free public access 2. Ensure researchers create data
management plans 3. Allow costs for data preservation and
access in proposal budgets 4. Ensure evaluation of data management
plan merits 5. Ensure researchers comply with their data
management plans 6. Promote data deposition into public
repositories 7. Develop approaches for identification and
attribution of datasets 8. Educate folks about data stewardship
From Flickr by Joe Crimmings Photography
From Flickr by Michael Tinkler
data management
From
Flic
kr b
y Bi
g Sw
ede
Guy
Best Practices
From Flickr by Mark Sardella
Plan before data collection
• Create a key (data dictionary) • Make sure names are unique • Define codes
From
Flic
kr b
y ze
bbie
Planning Design sample naming scheme
PhDcomics.com
Planning Design file naming scheme
Use descriptive file names • Unique • Reflect contents
From R Cook, ESA Best Practices Workshop 2010
Bad: Mydata.xls 2001_data.csv best version.txt
Better: Eaffinis_nanaimo_2010_counts.xls
Site name
Year What was measured
Study organism
*Not for everyone
*
Planning Design file naming scheme
Biodiversity
Lake
Experiments
Field work
Grassland
Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv …
From S. Hampton
Planning Design file organization
Consider… • Dependencies? • File formats? • Time of collection? • Order of analysis?
Planning
Constrain entries Atomize Break down spreadsheets
Design your spreadsheet
A relational database is A set of tables Relationships among the tables A language to specify & query the tables
A RDB provides
Scalability: millions+ records Features for sub-setting, querying, sorting Reduced redundancy & entry errors
From Mark Schildhauer
Planning Consider a database
You should invest time in learning databases if your data sets are large or complex
Consider investing time in learning databases if your data are small and humble you ever intend to share your data you are < 30 years old
Planning
From Mark Schildhauer
Consider a database
Store your data in a repository Institutional archive
Discipline/specialty archive
Pick a data repository
From Flickr by torkildr
Ask a librarian
Repos of repos: databib.org re3data.org
Planning
From
Flic
kr b
y se
pa s
ynod
From Flickr by taberandrew
From Flickr by withassociates
What software? What hardware? What personnel?
How often? Set up reminders!
Test system
Decide on preservation/backup Planning
…document that describes what you will
do with your data throughout
the research project
From Flickr by Barbies Land
Write a data management plan!
Planning
DMP components
But they all have different requirements and express them in
different ways
• What will be collected • Methods • Standards • Metadata • Sharing/access • Long-term storage
Planning
From Flickr by Barbies Land
Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community
dmptool.org Planning
During Data Collection & Entry
From Flickr by Julia Manzerova
Realistically: • Archive .csv version of raw data • Make a “raw” tab in working data file • Do all work on other tabs
During collection Keep raw data raw
Raw data as .csv
R script for processing & analysis
During collection
Ideally: • Use scripts to process data • Save them with data
Keep raw data raw
During collection Document your workflow
Temperature data
Salinity data
Data import into Excel
Analysis: mean, SD
Graph production
Quality control & data cleaning “Clean” T
& S data
Summary statistics
Data in spread-sheet
Workflow: how you get from the raw data to the final products of your research
Simple workflow: flow chart
During collection
Workflow: how you get from the raw data to the final products of your research
Simple workflow: commented script
• R, SAS, MATLAB… • Well-documented code is
Easier to review Easier to share Easier to use for repeat analysis
# % $
&
Document your workflow
Fancy schmancy workflows Resulting output
https://kepler-project.org
During collection Document your workflow
Workflows enable • Reproducibility • Transparency • Reuse
From Flickr by merlinprincesse
During collection Document your workflow
Constrain data entries • Excel lists • Data validation • Google docs forms
Modified from K. Vanderbilt
During collection
Atomize During collection
One piece of information per cell
Create parameter table
From doi:10.3334/ORNLDAAC/777
From doi:10.3334/ORNLDAAC/777
From R Cook, ESA Best Practices Workshop 2010
During collection Break down spreadsheets
Fake a relational database
Create a site table
Why are you promoting
Excel?
During collection Create metadata
Metadata: data reporting
WHO created the data? WHAT is the content
of the data set? WHEN was it created? WHERE was it collected? HOW was it developed? WHY was it developed?
From
Flic
kr b
y /\
/\ich
ael P
atric
|{
During collection Create metadata
Digital context • Name of the data set • The name(s) of the data file(s) in the
data set • Date the data set was last modified • Example data file records for each data
type file • Pertinent companion files • List of related or ancillary data sets • Software (including version number)
used to prepare/read the data set • Data processing that was performed Personnel & stakeholders • Who collected • Who to contact with questions • Funders
Scientific context • Scientific reason why the data were
collected • What data were collected • What instruments (including model & serial
number) were used • Environmental conditions during collection • Temporal & spatial resolution • Standards or calibrations used
Information about parameters • How each was measured or produced • Units of measure • Format used in the data set • Precision & accuracy if known
Information about data • Definitions of codes used • Quality assurance & control measures • Known problems that limit data use (e.g.
uncertainty, sampling problems)
During collection Create metadata
• Provide structure to describe data Common terms | definitions | language | structure
• Come in many flavors EML , FGDC, ISO19115, DarwinCore,…
• Can be met using software tools Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)
What is metadata?
Metadata standards…
During collection
Standard < Create metadata
Back up daily During collection
From Flickr by lippo
From Flickr by see phar
Original Near
Far
During collection
From Flickr by Barbies Land
Remember that data management plan?
Revisit Review Revise
During collection
Schedule a time each week or month
Revisit Review Revise
From Flickr by purplemattfish
From
Flickr by celikins
Where to start?
From Flickr by Andy Graulund
Make a resolution • Triage on current
projects • Get advisor, lab mates,
collaborators on board • Do better next time
Start working online
From Flickr by karindalziel
http://datapub.cdlib.org
Reproducibility, E-notebooks, Online science
Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community
dmptool.org Write a DMP
databib.org
Where should I put my data?
Find a repository
Get help
From
Flic
kr b
y th
ewm
att
From
Flic
kr b
y No
rth C
arol
ina D
igita
l He
ritag
e Ce
nter
From Flickr by Madison Guy
Get help from your library
Learn new skills software carpentry www.software-carpentry.org
From Flickr by Micah Taylor
Other Fun Stuff
Altmetrics?
Impact Factors
+ Citation Counts
Credit in academia…
Altmetrics Article-level metrics Altmetrics for alt-products
Data Code Slides Blogs
Downloads Tweets
Mentions Views
From Flickr by Skakerman
Altmetrics Article-level metrics Altmetrics for alt-products
Researcher Identification
BIG initiatives…
NSF funded DataNet Project Office of Cyberinfrastructure
www.dataone.org
New partners…
Better methods…
Better methods…
Science is changing.
Embrace it.
From Flickr by dotpolka
Manage & share your data!
Website Email
Twitter Slides
carlystrasser.net [email protected] @carlystrasser slideshare.net/carlystrasser