a centre of expertise in data curation and preservation
Digital Curation 101, October 6th-10th, 2008, NeSC, Edinburgh
Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
Create or Receive Scientific data
Dr. Frank Gibson and Dr. Phillip [email protected]@newcastle.ac.uk
a centre of expertise in data curation and preservation
Create or Receive
“In the standard model, one collects data, publishes a paper or papers and then gradually loses the original dataset.”
- Geoffrey Bowker
a centre of expertise in data curation and preservation
Create or ReceiveSlide by Cameron Neylon http://www.slideshare.net/CameronNeylon
a centre of expertise in data curation and preservation
Create or ReceiveSlide by Cameron Neylon http://www.slideshare.net/CameronNeylon
a centre of expertise in data curation and preservation
Create or ReceiveSlide by Cameron Neylon http://www.slideshare.net/CameronNeylon
a centre of expertise in data curation and preservation
Create or ReceiveSlide by Cameron Neylon http://www.slideshare.net/CameronNeylon
a centre of expertise in data curation and preservation
Create or Receivehttp://flickr.com/photos/nicmcphee/2756494307/
If we have a paper who cares about the data?
a centre of expertise in data curation and preservation
Create or Receive
A paper = a claim (or claims)
The full record that supports that claim should be available for detailed
examination and critique
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
a centre of expertise in data curation and preservation
Create or ReceiveSlide by Cameron Neylon http://www.slideshare.net/CameronNeylon
a centre of expertise in data curation and preservation
Create or ReceiveSlide by Cameron Neylon http://www.slideshare.net/CameronNeylon
a centre of expertise in data curation and preservation
Create or Receive
Funders
http://flickr.com/photos/luismimunoznajar/2093185804/
a centre of expertise in data curation and preservation
Create or Receive
Curation aims
AmenablePreservableOwnableAccessibleCitable
a centre of expertise in data curation and preservation
Create or Receive
Content
Syntax
Semantics
Significant Properties of Data
a centre of expertise in data curation and preservation
Create or Receive
Title
Creator
Type
Source
Date
Identifier
Publisher
Rights
a centre of expertise in data curation and preservation
Create or Receive
Simple Dublin Core
Title Creator Subject Description Publisher Contributor Date
Type Format
Identifier Source
Language Relation
Coverage Rights
a centre of expertise in data curation and preservation
Create or Receive
Choosing a Syntax• Openness
• -is there an open, publicly available specification for the format; are its specifications in the public domain; is it unencrypted?
• Portability • -is the format independent of hardware, operating system, of
other software; is it independent of particular institutions, groups, or events; is it in widespread current use; does it contain little or no built-in functionality?
• Quality • -is it robust; simple; highly tested; loss-free?
a centre of expertise in data curation and preservation
Create or Receive
Semantics can be complex
One semantic = many wordsMany words = one semantic
a centre of expertise in data curation and preservation
Create or Receive
• Excel data example – do I need it?
•Zeeberg et al. BMC Bioinformatics 2004 5:80 doi:10.1186/1471-2105-5-80 •Zeeberg et al. BMC Bioinformatics 2004 5:80 doi:10.1186/1471-2105-5-80
a centre of expertise in data curation and preservation
Create or Receive
What is fly?
•http://en.wikipedia.org/wiki/Image:Air_india_b747-400_vt-esn_arp.jpg
•http://en.wikipedia.org/wiki/Image:MuscuDomestica.jpg
•http://en.wikipedia.org/wiki/Image:Green_Highlander_salmon_fly.jpg
•http://en.wikipedia.org/wiki/Image:Fly_poster.jpg
•Fly
•Fly•Fly
•Fly
a centre of expertise in data curation and preservation
Create or Receive
Ontology• A controlled vocabulary is an association
between formal names (identifiers) and their definitions.
• An ontology is a controlled vocabulary augmented with logical constraints that describe their interrelationships.
a centre of expertise in data curation and preservation
Create or Receive
Ontologies for Life science• Emergence has occurred for two reasons• Consistent annotation of data• To add meaning and understanding that can
be interpreted computationally• Bio-ontologies registered on the OBO foundry
a centre of expertise in data curation and preservation
Create or Receive
Application of Significant PropertiesInProteomics
a centre of expertise in data curation and preservation
Create or Receive
Minimum Information about a Proteomics Experiment (MIAPE)• Sufficiency.
• The MIAPE guidelines should require sufficient information abouta dataset and its experimental context to allow a reader to understand and critically evaluate the interpretation and conclusions, and to support their experimental corroboration.
• Practicability. • Achieving compliance with MIAPE should not be so burdensome
as to prohibit its widespread use.
a centre of expertise in data curation and preservation
Create or Receive
Minimum reporting guidelines• Describe content• Implementation
independent
• Impacts • Publication• Syntax• Semantics
a centre of expertise in data curation and preservation
Create or Receive
Syntax for proteomics• The content in MIAPE GE needs to be structured to
facilitate • dissemination • transfer• storage
• A community development process to agree on a syntax • building upon the FuGE data model• A pre-existing community developed representation of
scientific experiments• Interoperable
a centre of expertise in data curation and preservation
Create or Receive
FuGE• Model of common components in science investigations, such
as materials, data, protocols, equipment and software. • Provides a framework for capturing complete laboratory
workflows, enabling the integration of pre-existing data formats.
a centre of expertise in data curation and preservation
Create or Receive
UML/XML/RDBMS• UML gives structure (but not syntax)
• Very abstract, very general• XML provides a concrete syntax
• Meta language is interoperable, checkable, viable and has basic metadata support (language, character coding and so on).
• Tends toward the verbose. Not (very) searchable for itself.• Therefore, transfer and archive format.
• RDBMS• SQL is (sort of) a standard• Highly computationally amenable form; v. good for searching• Conversion from XML is possible, but in a number of ways. • Hard work – nice to have an off-the-shelf implementation.
a centre of expertise in data curation and preservation
Create or Receive
Curation of Gel experiments
MAIPEGE
MAIPEGI
LaboratoryPublic repositoriesData entry and transfer
I) GelML data entry tools
GelML
II) Direct database submission
III) Automated export of GelInfoML
sepCV
a centre of expertise in data curation and preservation
Create or Receive
Discoverability and reuse
•Persistent Identifiers•Rights management
a centre of expertise in data curation and preservation
Create or Receive
Persistent Identifiers• a name for a resource which will remain the same
regardless of where the resource is located • In biology typically assigned to data upon publication• Type of identifier dependent on publication method
• Description and Representation Information provides more information about persistent identifiers
a centre of expertise in data curation and preservation
Create or Receive
Rights management• Difficult to determine • Lots of legal issues• In biology/bioinformatics
tends to be open access
•Creative commons
a centre of expertise in data curation and preservation
Create or Receive
Receiving data for curation
ContentSyntaxSemantics
a centre of expertise in data curation and preservation
Create or Receive
Route map
Route mapWho will receive it?
What are their policies on: Content, Syntax, Semantics
Plan your experiment to conform to Content, Syntax, Semantics
Implement experiment to;Collect appropriate ContentStructure in appropriate SyntaxEnsure Semantics are preserved
Curate…
a centre of expertise in data curation and preservation
Create or Receive
Meta Route Map• How to build the map if you don’t have one
yet.
a centre of expertise in data curation and preservation
Create or Receive
Appraise and Select• Investigates the evaluation and selection of
data for longterm curation and preservation
a centre of expertise in data curation and preservation
Create or Receive
Acknowledgments• The CARMEN project
• www.carmen.org.uk• The Proteomics Standards Initiative (PSI)
• http://psidev.info• Colleagues at Newcastle University
• Phillip Lord, Anil Wipat, Allyson Lister