Upload
noreen-madlyn-tucker
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Treating Data Like Software:
A Case for Production Quality Data
Treating Data Like Software:
A Case for Production Quality Data
Jennifer M. SchopfWHOI Ocean Informatics Working Group
(Also NSF – GEO/OAD)(Soon to be IEEE Computer Society!)
July, 2012
Jennifer M. SchopfWHOI Ocean Informatics Working Group
(Also NSF – GEO/OAD)(Soon to be IEEE Computer Society!)
July, 2012
The Hard Problem of Data .The Hard Problem of Data .
• Amount of data generated by
scientists is growing exponentially
• And yet we still don’t know how to• Collect data sets sustainably• Tag data sets in ways that others will agree• Discover data sets others have created• Make our own data sets accessible to a
broad audience
• Amount of data generated by
scientists is growing exponentially
• And yet we still don’t know how to• Collect data sets sustainably• Tag data sets in ways that others will agree• Discover data sets others have created• Make our own data sets accessible to a
broad audience
3
Data handling is really hard…Data handling is really hard…
…but maybe we can leverage what we know about building software:
• 45% scientists say they spend more time now developing software than they did 5 years ago
• 38% spent at least 1/5th of their time developing software
http://www.nature.com/news/2010/101013/full/467775a.html
…but maybe we can leverage what we know about building software:
• 45% scientists say they spend more time now developing software than they did 5 years ago
• 38% spent at least 1/5th of their time developing software
http://www.nature.com/news/2010/101013/full/467775a.html
4
5
Today’s QuestionToday’s Question
Can we leverage the (slightly more) formalized process of producing software to help us produce data?
Can we leverage the (slightly more) formalized process of producing software to help us produce data?
6
“Personal” use (Pre-Prototype)“Personal” use (Pre-Prototype)
• Used by me
• I do all the coding• My server for “repository”• My coding “standards”
• I’m the end user
• Used by me
• I do all the coding• My server for “repository”• My coding “standards”
• I’m the end user
7
“Personal” use (Pre-Prototype)“Personal” use (Pre-Prototype)
• No testing besides use
• No documentation • (~code comments)
• No “release” - Goes straight from code to compile to use (might have versioning)
• No testing besides use
• No documentation • (~code comments)
• No “release” - Goes straight from code to compile to use (might have versioning)
8
Prototype .Prototype .
•Used by my “group” • ~5-10 people?
•Coding• I do most, but they might add
•No real testing, documentation, etc•People might pick up new source once a day, or not bother
•Used by my “group” • ~5-10 people?
•Coding• I do most, but they might add
•No real testing, documentation, etc•People might pick up new source once a day, or not bother
10
Moving Toward ProductionMoving Toward Production• Used by someone I don’t know
• Coding by several folks• Might have a common repository• Coding “standards” depending
• Some testing
• Readme for doc
• Might have a “release” if there’s a repo.
• Used by someone I don’t know
• Coding by several folks• Might have a common repository• Coding “standards” depending
• Some testing
• Readme for doc
• Might have a “release” if there’s a repo.
12
Production Software (for Academics)Production Software (for Academics)
• Used by a lot of people I don’t know• Used by a lot of people I don’t know
13
Production Software (for Academics)Production Software (for Academics)
• Coding by larger group• Common repository with check-in
procedures• Agreed on coding standards
• Real sw architecture, naming, spacing, etc
• Coding by larger group• Common repository with check-in
procedures• Agreed on coding standards
• Real sw architecture, naming, spacing, etc
14
Production Software (for Academics)Production Software (for Academics)
• Formal testing• Unit tests, test harness, etc
• Documentation (and a bug fixing process)
• Formal release process
• License
• Formal testing• Unit tests, test harness, etc
• Documentation (and a bug fixing process)
• Formal release process
• License
15
Production Software FeaturesProduction Software FeaturesProduction SoftwareProduction Software• End User Considerations• Multiple coders
• Repository with check-in procedures
• Coding conventions
• Formal testing• Bug Fixes• Documentation
• Commenting, readme
• Formal release process
• License
• End User Considerations• Multiple coders
• Repository with check-in procedures
• Coding conventions
• Formal testing• Bug Fixes• Documentation
• Commenting, readme
• Formal release process
• License
16
So how does this relate to data?So how does this relate to data?Production SoftwareProduction Software• End User Considerations• Multiple coders
• Repository with check-in procedures
• Coding conventions
• Formal testing• Bug Fixes• Documentation
• Commenting, readme
• Formal release process
• License
• End User Considerations• Multiple coders
• Repository with check-in procedures
• Coding conventions
• Formal testing• Bug Fixes• Documentation
• Commenting, readme
• Formal release process
• License
Production DataProduction Data• End User Considerations• Mult. producers/collectors
• (Local) archive with check-in procedures
• Collection conventions
• Formal testing• QA/QC, Bug fixes• Documentation
• Metadata, workflow compat
• Formal release process to external archive
• License and Citation
• End User Considerations• Mult. producers/collectors
• (Local) archive with check-in procedures
• Collection conventions
• Formal testing• QA/QC, Bug fixes• Documentation
• Metadata, workflow compat
• Formal release process to external archive
• License and Citation
17
Bottom LineBottom Line
• As more people use your “stuff” you need to formalize how you approach it to make it still useful
• The more people you collaborate with to create your “stuff”, the more process you need to make sure things are coordinated
• As more people use your “stuff” you need to formalize how you approach it to make it still useful
• The more people you collaborate with to create your “stuff”, the more process you need to make sure things are coordinated
18
What is “data”?What is “data”?
• Observations?
• Data analysis results?
• Modeling results?
• Software?
• Metadata? (One person’s metadata is another person’s data…)
• Observations?
• Data analysis results?
• Modeling results?
• Software?
• Metadata? (One person’s metadata is another person’s data…)
19
“Data” refers to everything
needed to have
reproducible science
“Data” refers to everything
needed to have
reproducible science
20
Who’s Using Your Data SetsWho’s Using Your Data Sets
• This is all about sharing• If no one else has access to your
data/code, then it doesn’t matter
• Collaborative science• Approach to science is fundamentally
changing• Your noise is someone else’s signal
• Reproducible science
• This is all about sharing• If no one else has access to your
data/code, then it doesn’t matter
• Collaborative science• Approach to science is fundamentally
changing• Your noise is someone else’s signal
• Reproducible science
21
Local Archive Check-inLocal Archive Check-in
• In SW-world this involves some kind of code check-in to a repository• Get a sanity check
• When data comes off an instrument or out of a notebook, there needs to be a (very basic) correctness check• Columns in the right order• Fields fully propagated• Boundary conditions
• In SW-world this involves some kind of code check-in to a repository• Get a sanity check
• When data comes off an instrument or out of a notebook, there needs to be a (very basic) correctness check• Columns in the right order• Fields fully propagated• Boundary conditions
22
Testing, QA/QC, Bug fixesTesting, QA/QC, Bug fixes
• Make it reliable, make it useful
• Quality assurance echo’s running a test suite• Check data ranges• Correct for known instrument error• Sometimes first derived data products
• One difference from SW• Some people want the data pre- QA/QC
• Make it reliable, make it useful
• Quality assurance echo’s running a test suite• Check data ranges• Correct for known instrument error• Sometimes first derived data products
• One difference from SW• Some people want the data pre- QA/QC
23
Bug FixesBug Fixes
• One of the fatal flaws with the “publish” approach to data• Sometime data needs to be updated!
• You may find this, or someone else may• Any fix should become a step in the QA/QC
process
• Sometimes bug fixes are actually suggestions for new features• Needed as well for the next time you collect
data
• One of the fatal flaws with the “publish” approach to data• Sometime data needs to be updated!
• You may find this, or someone else may• Any fix should become a step in the QA/QC
process
• Sometimes bug fixes are actually suggestions for new features• Needed as well for the next time you collect
data
24
DocumentationDocumentation
• Make it usable
• Need more than just metadata over time
• How was the data collected• Details on instruments, QA/QC, etc
• How can the data be used• And how should the data NOT be used
• Where can someone find out more about your science?
• Make it usable
• Need more than just metadata over time
• How was the data collected• Details on instruments, QA/QC, etc
• How can the data be used• And how should the data NOT be used
• Where can someone find out more about your science?
25
Formal Release Process(to external archive)
Formal Release Process(to external archive)
• Note “Release” – Not “publication”
• “The data publication metaphor can be misleading and may even countermand aspects of good data stewardship.”
• -Mark Parsons and Peter Fox • Is Data Publication the Right Metaphor?
http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html
• Similar to software release – formal and planned for production quality
• Note “Release” – Not “publication”
• “The data publication metaphor can be misleading and may even countermand aspects of good data stewardship.”
• -Mark Parsons and Peter Fox • Is Data Publication the Right Metaphor?
http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html
• Similar to software release – formal and planned for production quality
28
LicenseLicense
• Get credit for your work
• Creative commons license• You keep your copyright but allow people
to copy and distribute your work provided they give you credit — and only on the conditions you specify
• Every data set should come with citation information
• Get credit for your work
• Creative commons license• You keep your copyright but allow people
to copy and distribute your work provided they give you credit — and only on the conditions you specify
• Every data set should come with citation information
29
• Long term costs• Needs love and attention• May lose charm after growing up• Occasional clean-ups required• Many left abandoned by their owners• May not be quite what you think
• Long term costs• Needs love and attention• May lose charm after growing up• Occasional clean-ups required• Many left abandoned by their owners• May not be quite what you think
Open Source Software isLike a Free Puppy
Open Source Software isLike a Free Puppy
Recap on building production dataRecap on building production data
• Local archive – get a sanity check
• Testing- make it reliable
• QA/QC, Bug fixes – make it useful
• Documentation – make it usable
• Metadata – make it understandable
• Formal release – make it stable
• Citation – get some credit
• Local archive – get a sanity check
• Testing- make it reliable
• QA/QC, Bug fixes – make it useful
• Documentation – make it usable
• Metadata – make it understandable
• Formal release – make it stable
• Citation – get some credit
32
Today’s QuestionToday’s Question
Can we leverage the (slightly more) formalized process of producing software to help us produce data?
Can we leverage the (slightly more) formalized process of producing software to help us produce data?
33
Managing Data Like SoftwareManaging Data Like SoftwareProduction SoftwareProduction Software• End User Considerations• Multiple coders
• Repository with check-in procedures
• Coding conventions
• Formal testing• Bug Fixes• Documentation
• Commenting, readme
• Formal release process
• License
• End User Considerations• Multiple coders
• Repository with check-in procedures
• Coding conventions
• Formal testing• Bug Fixes• Documentation
• Commenting, readme
• Formal release process
• License
Production DataProduction Data• End User Considerations• Mult. producers/collectors
• (Local) archive with check-in procedures
• Collection conventions
• Formal testing• QA/QC, Bug fixes• Documentation
• Metadata, workflow compat
• Formal release process to external archive
• License and Citation
• End User Considerations• Mult. producers/collectors
• (Local) archive with check-in procedures
• Collection conventions
• Formal testing• QA/QC, Bug fixes• Documentation
• Metadata, workflow compat
• Formal release process to external archive
• License and Citation
34
Contact PointsContact Points
• Jennifer Schopf• [email protected]
This talk based on content written up in:
“Treating Data Like Software: A Case for Production Quality Data”, Proceedings of the Joint Conference on Digital Libraries, June 2012.
http://delivery.acm.org/10.1145/2240000/2232846/p153-schopf.pdf
• Jennifer Schopf• [email protected]
This talk based on content written up in:
“Treating Data Like Software: A Case for Production Quality Data”, Proceedings of the Joint Conference on Digital Libraries, June 2012.
http://delivery.acm.org/10.1145/2240000/2232846/p153-schopf.pdf
35