Upload
john-jones
View
216
Download
0
Embed Size (px)
Citation preview
Principles and Practice in (Encouraging) the Sharing of
Public Research Data
Chris Taylor, The MIBBI Project [email protected]
Project website: http://mibbi.org/
Well-oiled cogs meshing perfectly (would be nice)
How well are things working?—Cue the Tower of Babel analogy…—Situation is improving with respect to standards—But few tools, fewer carrots (though some
sticks)
Why do we care about that..?—Data exchange—Comprehensibility (/quality) of work—Scope for reuse (parallel or orthogonal)
“Publicly-funded research data are a public good, produced in the public interest”
“Publicly-funded research data should be openly available to the maximum extent possible.”
So what (/why) is a standards body again..?
Consider the three main ‘omics standards bodies’
— PSI (proteomics), GSC (genomics), MGED (transcriptomics)
— What defines such (candidate) standards-generating bodies?
— “A beer and an airline” (Zappa)— Requirements, formats, vocabulary— Regular full-blown open attendance meetings, lists,
etc.
Hugely dependent on their respective communities
— Requirements (What are we doing and why are we doing it?)
— Development (By the people, for the people. Mostly.)— Testing (No it isn’t finished, but yes I’d like you to use
it…)— Uptake, by all of the various kinds of stakeholder:
— Publishers, funders, vendors, tool/database developers
— The user community (capture, store, search, analyse)
MIAPE (http://psidev.info/miape)• Minimum Information About a Proteomics Experiment• Published in Nature Biotechnology (August 2007)• Recommendation for journals, repositories, funders and
others• Technology-specific modules associated with a parent
document— Users assemble modules into a bespoke reporting
requirement— Molecular interactions (MIMIx) published in NBT in 2007— Mass spec, MS informatics, gels published in NBT in
2008— Gel informatics, columns, CE published in NBT this year
Example: Minimum reporting requirements
ProteoRED’s MIAPE satisfaction survey
Spanish multi-site collaboration: provision of proteomics services MIAPE customer satisfaction survey (compiled November 2008)
— http://www.proteored.org/MIAPE_Survey_Results_Nov08.html— Responses from 31 proteomics experts representing 17 labs
Yes: 95%No: 5%
Ingredients for MI pie
Domain specialists & IT types (initial drafts, evolution)Journals
—The real issue for any MI project is getting enough people to comment on what you have (distinguishes a toy project from something to be taken seriously — community buy-in)
—Having journals help in garnering reviews is great (editorials, web site links, mail shots even). Their motive of course being that fuller reporting = better content = higher citation index.
Funders—MI projects can claim to be slightly outside of 'normal' science;
may form funding policy components (arguments about maximum value)
—Funders therefore have a motive (similar to journals) to ensure that MI guidelines, which they may endorse down the line, are representative and mature
—They can help by allocating slots at (appropriate) meetings of their award holders for you to show your stuff. Things like that.
Vendors—The cost of MIs in person-hours will be the major objection—Vendors can implement parameter export to an appropriate file
format, ideally using some helpful CV (somebody else's problems)
—Vendors also have engineers (and some sales staff) who really know their kit and make for great contributors/reviewers
—For some standards bodies (like PSI, MGED) their sponsorship has been very helpful too (believe it or not, it would seem possible to monetise a standards body)
Food / pharma—Already used to better, if rarely perfect data capture and
management; for example, 21 CFR Part 11 (MI = exec summary…)
Trainers—There is a small army of individuals training scientists,
especially in relation to IT (EBI does a lot of this but I mean commercial training providers) ‘Resource packs’
Ingredients for MI pie
Technologically-delineated views of the world A: transcriptomics B: proteomics C: metabolomics …and…
Biologically-delineated views of the world A: plant biology B: epidemiology C: microbiology …and…
Generic features (‘common core’) — Description of source biomaterial — Experimental design components
Arrays
Scanning Arrays &Scanning
Columns
GelsMS MS
FTIR
NMR
Columns
Modelling the biosciences
Modelling the biosciences (slightly differently)
Assay: Omics and miscellaneous techniques
Investigation:
Medical syndrome, environmental effect, etc.Study: Toxicology, environmental science, etc.
Investigation / Study / Assay (ISA) Infrastructurehttp://isatab.sourceforge.net/
Ontology of Biomedical Investigations (OBI)http://obi.sourceforge.net/
Functional Genomics Experiment (FuGE)http://fuge.sourceforge.net/
Rise of the Metaprojects
Reporting guidelines — a case in point
MIAME, MIAPE, MIAPA, MIACA, MIARE, MIFACE, MISFISHIE, MIGS, MIMIx, MIQAS, MIRIAM, (MIAFGE, MIAO), My Goodness…
‘MI’ checklists usually developed independently, by groups working within particular biological or technological domains
— Difficult to obtain an overview of the full range of checklists
— Tracking the evolution of single checklists is non-trivial— Checklists are inevitably partially redundant one against
another— Where they overlap arbitrary decisions on wording and
sub structuring make integration difficult
Significant difficulties for those who routinely combine information from multiple biological domains and technology platforms
— Example: An investigation looking at the impact of toxins on a sentinel species using proteomics (‘eco-toxico-proteomics’)
— What reporting standard(s) should they be using?
The MIBBI Project (mibbi.org)
International collaboration between communities developing ‘Minimum Information’ (MI) checklists
Two distinct goals (Portal and Foundry)—Raise awareness of various minimum reporting
specifications—Promote gradual integration of checklists
Lots of enthusiasm (drafters, users, funders, journals)
32 projects committed (to the portal) to date, including:—MIGS, MINSEQE & MINIMESS (genomics, sequencing) —MIAME (μarrays), MIAPE (proteomics), CIMR
(metabolomics)—MIGen & MIQAS (genotyping), MIARE (RNAi), MISFISHIE
(in situ)
The MIBBI Project (mibbi.org)
[†] Denotes that a specification is provided as a suite of related documents
CONCEPT SPECIALISATION ● C
IMR [†]
● M
IACA
● M
IAM
E
● M
IAM
E/E
nv
● M
IAM
E/N
utr
● M
IAM
E/P
lant
● M
IAM
E/T
ox
● M
IAPA
● M
IAPE [†]
● M
IARE
● M
IFlo
wCyt
● M
IGen
● M
IGS/M
IMS
● M
IMIx
● M
IMPP
● M
INI
study inputs study design ●generic organism ●
cells / microbes
plant
animal
mouse
human
population
environmental sample
environment / habitat
in silico model
study procedures organism maintenance
animal husbandry
cell / microbe culture
plant cultivation
acclimation
preconditioning / pretreatment ●organism manipulation
assay inputs generic study input
organism part ●organism state
organism trait
biomolecule
synthetic analyte ●silencing RNA reagent
Version 0.7 (2008-04-10)
Comparison of MIBBI-registered projects [21] ● Release
Granularity Coarse Medium Fine
Maturity ● Planned ● Drafting
The MIBBI Project (mibbi.org)
Interaction graph for projects (line thickness & colour saturation show similarity)
MIBBI and other standardisation efforts
Ontology support: MIBBI module schema allows for specified ontology references
Any number of terms (leaf or node) can be ‘attached’ to an element
— We expect software to offer the specified choices to users
Format support: MIBBI has no specific implementation for data exchange
formats BUT: we can achieve the same end by supporting tools
— Currently implementing ISAcreator configuration file generation
— Will allow capture of MIBBI Foundry-specified content in ISA-Tab
— Also an example of software implementing our ontology links
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
24
Example of guiding the experimentalist to search and select a term from the EnvO ontology, to describe the habitat of a sample
Ontologies, accessed in real time via the Ontology Lookup Service and
BioPortal
The objections to fuller reporting
Why should I dedicate resources to providing data to others?
—Pro bono arguments have no impact (altruism is a myth)—‘Sticks’ from funders and publishers get the bare minimum—No traceability in most contexts (intellectual property = ?)
This is just a ‘make work’ scheme for bioinformaticians—Bioinformaticians get a buzz out of having big databases—Parasites benefitting from others’ work ( mutualism..?)
I don’t trust anyone else’s data — I’d rather repeat work—Problems of quality, which are justified to an extent—But what of people lacking resources or specific expertise?
How on earth am I supposed to do this anyway..?—Perception that there is no money to pay for this—No mature free tools — Excel sheets are no good for HT—Worries about vendor support, legacy systems (business
models)
1. Increased efficiency
Methods remain properly associated with the results generated
—Data sets generated by specific techniques or using particular materials can be reliably identified and retrieved from repositories (or excluded from results sets)
No need to repeatedly construct sets of contextualizing information
—Facilitates the sharing of data with collaborators—Avoids the risk of loss of information through staff turnover—Enables time-efficient handover of projects
For industry specifically (in the light of 21 CFR Part 11)—The relevance of data can be assessed through summaries
without wasting time wading through full data setsin diverse proprietary formats (‘business intelligence’)
—Public data can be leveraged as ‘commercial intelligence’
2. Enhanced confidence in data
Enables fully-informed assessment of results (methods used etc.)
Supports the assessment of results that may have been generated months or even years ago (e.g. for referees or regulators)
Facilitates better-informed comparisons of data sets— Increased likelihood of discovering the factors (controlled
and uncontrolled) that might differentiate those data sets
Supports the discovery of sources of systematic or random error by correlating errors with metadata features such as the date or the operator concerned
Requires sufficient information to support the design of appropriate parallel or orthogonal studies to confirm or refute a given result
3. Added value, tool development
Re-using existing data sets for a purpose significantly different to that for which the data were generated
Building aggregate data sets containing (similar) data from different sources (including standards-compliant public repositories)
Integrating data from different domains —For example, correlating changes in mRNA abundances,
protein turnover and metabolic fluxes in response to a stimulus
Design requirements become both explicit and stable— MIAPE modules as driving use cases (tools, formats, CV, DBs)— Promotes the development of sophisticated analysis algorithms— Presentation of information can be ‘tuned’ appropriately— Makes for a more uniform experience
Credit where credit’s due
Data sharing is more or less a given now, and tools are emerging—Lots of sticks, but they only get the bare minimum—How to get the best out of data generators?—Need standards- and user-friendly tools, and meaningful
credit
Central registries of data sets that can record reuse—Well-presented, detailed papers get cited more frequently—The same principle should apply to data sets—ISNIs for people, DOIs for data: http://www.datacite.org/
Side-benefits, challenges—Would also clear up problems around paper authorship—Would enable other kinds of credit (training, curation, etc.)—Community policing — researchers ‘own’ their credit
portfolio (enforcement body useful, more likely through review)
—Problem of ‘micro data sets’ and legacy data