30
Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be IEEE Computer Society!) July, 2012

Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Embed Size (px)

Citation preview

Page 1: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Treating Data Like Software:

A Case for Production Quality Data

Treating Data Like Software:

A Case for Production Quality Data

Jennifer M. SchopfWHOI Ocean Informatics Working Group

(Also NSF – GEO/OAD)(Soon to be IEEE Computer Society!)

July, 2012

Jennifer M. SchopfWHOI Ocean Informatics Working Group

(Also NSF – GEO/OAD)(Soon to be IEEE Computer Society!)

July, 2012

Page 2: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

The Hard Problem of Data .The Hard Problem of Data .

• Amount of data generated by

scientists is growing exponentially

• And yet we still don’t know how to• Collect data sets sustainably• Tag data sets in ways that others will agree• Discover data sets others have created• Make our own data sets accessible to a

broad audience

• Amount of data generated by

scientists is growing exponentially

• And yet we still don’t know how to• Collect data sets sustainably• Tag data sets in ways that others will agree• Discover data sets others have created• Make our own data sets accessible to a

broad audience

3

Page 3: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Data handling is really hard…Data handling is really hard…

…but maybe we can leverage what we know about building software:

• 45% scientists say they spend more time now developing software than they did 5 years ago

• 38% spent at least 1/5th of their time developing software

http://www.nature.com/news/2010/101013/full/467775a.html

…but maybe we can leverage what we know about building software:

• 45% scientists say they spend more time now developing software than they did 5 years ago

• 38% spent at least 1/5th of their time developing software

http://www.nature.com/news/2010/101013/full/467775a.html

4

Page 4: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

5

Page 5: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Today’s QuestionToday’s Question

Can we leverage the (slightly more) formalized process of producing software to help us produce data?

Can we leverage the (slightly more) formalized process of producing software to help us produce data?

6

Page 6: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

“Personal” use (Pre-Prototype)“Personal” use (Pre-Prototype)

• Used by me

• I do all the coding• My server for “repository”• My coding “standards”

• I’m the end user

• Used by me

• I do all the coding• My server for “repository”• My coding “standards”

• I’m the end user

7

Page 7: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

“Personal” use (Pre-Prototype)“Personal” use (Pre-Prototype)

• No testing besides use

• No documentation • (~code comments)

• No “release” - Goes straight from code to compile to use (might have versioning)

• No testing besides use

• No documentation • (~code comments)

• No “release” - Goes straight from code to compile to use (might have versioning)

8

Page 8: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Prototype .Prototype .

•Used by my “group” • ~5-10 people?

•Coding• I do most, but they might add

•No real testing, documentation, etc•People might pick up new source once a day, or not bother

•Used by my “group” • ~5-10 people?

•Coding• I do most, but they might add

•No real testing, documentation, etc•People might pick up new source once a day, or not bother

10

Page 9: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Moving Toward ProductionMoving Toward Production• Used by someone I don’t know

• Coding by several folks• Might have a common repository• Coding “standards” depending

• Some testing

• Readme for doc

• Might have a “release” if there’s a repo.

• Used by someone I don’t know

• Coding by several folks• Might have a common repository• Coding “standards” depending

• Some testing

• Readme for doc

• Might have a “release” if there’s a repo.

12

Page 10: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Production Software (for Academics)Production Software (for Academics)

• Used by a lot of people I don’t know• Used by a lot of people I don’t know

13

Page 11: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Production Software (for Academics)Production Software (for Academics)

• Coding by larger group• Common repository with check-in

procedures• Agreed on coding standards

• Real sw architecture, naming, spacing, etc

• Coding by larger group• Common repository with check-in

procedures• Agreed on coding standards

• Real sw architecture, naming, spacing, etc

14

Page 12: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Production Software (for Academics)Production Software (for Academics)

• Formal testing• Unit tests, test harness, etc

• Documentation (and a bug fixing process)

• Formal release process

• License

• Formal testing• Unit tests, test harness, etc

• Documentation (and a bug fixing process)

• Formal release process

• License

15

Page 13: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Production Software FeaturesProduction Software FeaturesProduction SoftwareProduction Software• End User Considerations• Multiple coders

• Repository with check-in procedures

• Coding conventions

• Formal testing• Bug Fixes• Documentation

• Commenting, readme

• Formal release process

• License

• End User Considerations• Multiple coders

• Repository with check-in procedures

• Coding conventions

• Formal testing• Bug Fixes• Documentation

• Commenting, readme

• Formal release process

• License

16

Page 14: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

So how does this relate to data?So how does this relate to data?Production SoftwareProduction Software• End User Considerations• Multiple coders

• Repository with check-in procedures

• Coding conventions

• Formal testing• Bug Fixes• Documentation

• Commenting, readme

• Formal release process

• License

• End User Considerations• Multiple coders

• Repository with check-in procedures

• Coding conventions

• Formal testing• Bug Fixes• Documentation

• Commenting, readme

• Formal release process

• License

Production DataProduction Data• End User Considerations• Mult. producers/collectors

• (Local) archive with check-in procedures

• Collection conventions

• Formal testing• QA/QC, Bug fixes• Documentation

• Metadata, workflow compat

• Formal release process to external archive

• License and Citation

• End User Considerations• Mult. producers/collectors

• (Local) archive with check-in procedures

• Collection conventions

• Formal testing• QA/QC, Bug fixes• Documentation

• Metadata, workflow compat

• Formal release process to external archive

• License and Citation

17

Page 15: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Bottom LineBottom Line

• As more people use your “stuff” you need to formalize how you approach it to make it still useful

• The more people you collaborate with to create your “stuff”, the more process you need to make sure things are coordinated

• As more people use your “stuff” you need to formalize how you approach it to make it still useful

• The more people you collaborate with to create your “stuff”, the more process you need to make sure things are coordinated

18

Page 16: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

What is “data”?What is “data”?

• Observations?

• Data analysis results?

• Modeling results?

• Software?

• Metadata? (One person’s metadata is another person’s data…)

• Observations?

• Data analysis results?

• Modeling results?

• Software?

• Metadata? (One person’s metadata is another person’s data…)

19

Page 17: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

“Data” refers to everything

needed to have

reproducible science

“Data” refers to everything

needed to have

reproducible science

20

Page 18: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Who’s Using Your Data SetsWho’s Using Your Data Sets

• This is all about sharing• If no one else has access to your

data/code, then it doesn’t matter

• Collaborative science• Approach to science is fundamentally

changing• Your noise is someone else’s signal

• Reproducible science

• This is all about sharing• If no one else has access to your

data/code, then it doesn’t matter

• Collaborative science• Approach to science is fundamentally

changing• Your noise is someone else’s signal

• Reproducible science

21

Page 19: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Local Archive Check-inLocal Archive Check-in

• In SW-world this involves some kind of code check-in to a repository• Get a sanity check

• When data comes off an instrument or out of a notebook, there needs to be a (very basic) correctness check• Columns in the right order• Fields fully propagated• Boundary conditions

• In SW-world this involves some kind of code check-in to a repository• Get a sanity check

• When data comes off an instrument or out of a notebook, there needs to be a (very basic) correctness check• Columns in the right order• Fields fully propagated• Boundary conditions

22

Page 20: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Testing, QA/QC, Bug fixesTesting, QA/QC, Bug fixes

• Make it reliable, make it useful

• Quality assurance echo’s running a test suite• Check data ranges• Correct for known instrument error• Sometimes first derived data products

• One difference from SW• Some people want the data pre- QA/QC

• Make it reliable, make it useful

• Quality assurance echo’s running a test suite• Check data ranges• Correct for known instrument error• Sometimes first derived data products

• One difference from SW• Some people want the data pre- QA/QC

23

Page 21: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Bug FixesBug Fixes

• One of the fatal flaws with the “publish” approach to data• Sometime data needs to be updated!

• You may find this, or someone else may• Any fix should become a step in the QA/QC

process

• Sometimes bug fixes are actually suggestions for new features• Needed as well for the next time you collect

data

• One of the fatal flaws with the “publish” approach to data• Sometime data needs to be updated!

• You may find this, or someone else may• Any fix should become a step in the QA/QC

process

• Sometimes bug fixes are actually suggestions for new features• Needed as well for the next time you collect

data

24

Page 22: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

DocumentationDocumentation

• Make it usable

• Need more than just metadata over time

• How was the data collected• Details on instruments, QA/QC, etc

• How can the data be used• And how should the data NOT be used

• Where can someone find out more about your science?

• Make it usable

• Need more than just metadata over time

• How was the data collected• Details on instruments, QA/QC, etc

• How can the data be used• And how should the data NOT be used

• Where can someone find out more about your science?

25

Page 23: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Formal Release Process(to external archive)

Formal Release Process(to external archive)

• Note “Release” – Not “publication”

• “The data publication metaphor can be misleading and may even countermand aspects of good data stewardship.”

• -Mark Parsons and Peter Fox • Is Data Publication the Right Metaphor?

http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html

• Similar to software release – formal and planned for production quality

• Note “Release” – Not “publication”

• “The data publication metaphor can be misleading and may even countermand aspects of good data stewardship.”

• -Mark Parsons and Peter Fox • Is Data Publication the Right Metaphor?

http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html

• Similar to software release – formal and planned for production quality

28

Page 24: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

LicenseLicense

• Get credit for your work

• Creative commons license• You keep your copyright but allow people

to copy and distribute your work provided they give you credit — and only on the conditions you specify

• Every data set should come with citation information

• Get credit for your work

• Creative commons license• You keep your copyright but allow people

to copy and distribute your work provided they give you credit — and only on the conditions you specify

• Every data set should come with citation information

29

Page 25: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

• Long term costs• Needs love and attention• May lose charm after growing up• Occasional clean-ups required• Many left abandoned by their owners• May not be quite what you think

• Long term costs• Needs love and attention• May lose charm after growing up• Occasional clean-ups required• Many left abandoned by their owners• May not be quite what you think

Open Source Software isLike a Free Puppy

Open Source Software isLike a Free Puppy

Page 26: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be
Page 27: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Recap on building production dataRecap on building production data

• Local archive – get a sanity check

• Testing- make it reliable

• QA/QC, Bug fixes – make it useful

• Documentation – make it usable

• Metadata – make it understandable

• Formal release – make it stable

• Citation – get some credit

• Local archive – get a sanity check

• Testing- make it reliable

• QA/QC, Bug fixes – make it useful

• Documentation – make it usable

• Metadata – make it understandable

• Formal release – make it stable

• Citation – get some credit

32

Page 28: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Today’s QuestionToday’s Question

Can we leverage the (slightly more) formalized process of producing software to help us produce data?

Can we leverage the (slightly more) formalized process of producing software to help us produce data?

33

Page 29: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Managing Data Like SoftwareManaging Data Like SoftwareProduction SoftwareProduction Software• End User Considerations• Multiple coders

• Repository with check-in procedures

• Coding conventions

• Formal testing• Bug Fixes• Documentation

• Commenting, readme

• Formal release process

• License

• End User Considerations• Multiple coders

• Repository with check-in procedures

• Coding conventions

• Formal testing• Bug Fixes• Documentation

• Commenting, readme

• Formal release process

• License

Production DataProduction Data• End User Considerations• Mult. producers/collectors

• (Local) archive with check-in procedures

• Collection conventions

• Formal testing• QA/QC, Bug fixes• Documentation

• Metadata, workflow compat

• Formal release process to external archive

• License and Citation

• End User Considerations• Mult. producers/collectors

• (Local) archive with check-in procedures

• Collection conventions

• Formal testing• QA/QC, Bug fixes• Documentation

• Metadata, workflow compat

• Formal release process to external archive

• License and Citation

34

Page 30: Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be

Contact PointsContact Points

• Jennifer Schopf• [email protected]

This talk based on content written up in:

“Treating Data Like Software: A Case for Production Quality Data”, Proceedings of the Joint Conference on Digital Libraries, June 2012.

http://delivery.acm.org/10.1145/2240000/2232846/p153-schopf.pdf

• Jennifer Schopf• [email protected]

This talk based on content written up in:

“Treating Data Like Software: A Case for Production Quality Data”, Proceedings of the Joint Conference on Digital Libraries, June 2012.

http://delivery.acm.org/10.1145/2240000/2232846/p153-schopf.pdf

35