View
219
Download
5
Category
Tags:
Preview:
Citation preview
RightField: Semantic Enrichment of Systems Biology Data using
Spreadsheets
Katy WolstencroftmyGrid, SysMO-DB
University of Manchester
Outline
What RightField does Origins - SysMO-DB project and data sharing in
Systems Biology How RightField works Evaluation – how successfully it works Extensions and future directions
RightField
A tool for embedding ranges of ontology terms into spreadsheets to allow the users of those spreadsheets to semantically annotate their data from simple drop-down lists
A tool for automatically extracting semantically annotated metadata from spreadsheets and producing RDF
RightField
Annotation benefits Makes annotation quicker and more efficient Standardises annotation Hides the ontology complexity from the usersRDF production benefits Querying over heterogeneous data files Semantic searching and reasoning Standard format for interoperability Hides semantic web tools from end users
Spreadsheets and web browsers
SEEK: Systems Biology Data SharingThe SEEK
Systems Biology of MicroOrganisms Pan-European
> 100 research groups > 320 scientists
Distributed, interdisciplinary projects
Expected to pool data and results and disseminate
Microbiologists, molecular biologists, biochemists, mathematicians....not many informaticians
SysMO Consortium
A platform for Systems Biology data and models sharing
Web based environment for sharing within a consortium and disseminating to the community (an eLaboratory)
Standards Compliant Fitting in with laboratory
practices
~ 1900 assets People – 350 Investigations - 35 Studies - 87 Assays - 167 Data sets - 930 Models - 60 SOPs - 140 Publications -165
Types of data Multiple omics
genomics, transcriptomics proteomics, metabolomics fluxomics, reactomics
Images Molecular biology Reaction Kinetics Models
Metabolic, gene network, kinetic Relationships between data sets/experiments
Procedures, experiments, data, results and models Analysis of data
Minimum Information Model
What is the least amount of information required to: Find Interpret Understand Reuse
Different for different data sets
CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment
Not quite available “off the shelf”
Loose guidelines or checklists Specific formats (generally in XML) Specific formats with associated ontologies
Remaining questions for the scientists: How do we generate standards compliant data? Which vocabularies/ontologies should I use? How do I know which ontology terms to use
where?
DataMIBBI Model Ontologies
Microarray MIAME:Minimum Information about a Microarray Experiment
MGED
Proteomics MIAPE: Minimum Information about a Proteomics Experiment
PSI-MI, PSI-MS, PSI-MOD
Interaction experiments
MIMIX:Minimum Information about a Molecular Interaction Experiment
PSI-MI
Protein-Protein Interaction
Systems Biology Models
MIRIAM:Minimal Information Required In the Annotation of biochemical Models
SBO: Systems Biology Ontology
Systems Biology Model Simulation
MIASE:Minimum Information About a Simulation Experiment
KISAO:Kinetic Simulation Algorithm Ontology
SOP
Data Templates and Vocabularies
Construction Validation
SOP
SOP
Metabolomics
Metabolomics
Mass Spec
Transcriptomics
Proteomics
Fluxomics
Investigations
Studies
Assays
Fitting in with Laboratory practices
Scientists can continue to do what they have always done
Scientists remain in control Embedding semantics into the tools already in
use Excel, excel, excel.....
RightField Architecture
JavaPlatform Independent
OWL APILoading ontologies and reasoning
Apache POI HSSF librariesLoading and saving of Excel Spreadsheets
Excel Workbook
Ontology“Portion” of ontology terms
Terms Embedded into Excel Workbook
RightField Client
How RightField Works – part 1
Marked-up workbookSaved in plain Excel
Informaticians/ontologists End Users
Loading Ontologies from BioPortal
Published ontologiesPublished ontologies
Multiple versionsMultiple versions
You can also load local ontologies from file or URL
JERM = “Just Enough Results Model” What type of data is it
Microarray, growth curve, enzyme activity… What was measured
Gene expression, OD, metabolite concentration…. What do the values in the datasets mean
Units, time series, repeats….
Selected parent term from the ontology
Methods for specifying ontology terms
Term lists for selected cells
Value Type and Property
Provenance and Identifiers
Term LabelThe human readable term label
Term IRI The (unique) term identifier
Ontology IRI
Ontology Version
The ontology that defines the term
The version of the ontology
Physical LocationThe (web) location of the ontology
Ontology Information
Ontologies encapsulated Scientists can work offline Ensures same versions of ontologies used for a series
of experiments No special macros or plugins required, just Excel or
Open Office Versions and URIs captured in hidden worksheets
Provenance Comparisons between sheets Linking back to the vocabularies
Store / Reuse
RDF Graph
Populate
Extract
Metadata Extraction and Querying
Generates RDF triples for each marked up cell Simple RDF, or conforming to ontology models Storage and querying solutions
Virtuoso triple store Linked data compliance
Already HTML and XML interface and REST API
RightField Annotation Evaluation
Does RightField improve the quantity and consistency of data annotation? Improvements in annotation consistency
Assay type Technology type Experimental conditions Factors studied Organism and strains
RightField Annotation Evaluation
JERM Metadata Element Scores
Dataset ID RightField Template Pre-RightField Template
598 616 244
599 319 402
72 119 85
868 203 62
69 127 88
Metadata Extraction and Querying
No current ‘standard’ RDF format for MIBBI models (although it is in progress)
RDF vs ‘traditional’ relational approaches RDF more flexible in dealing with optional and
changing metadata elements RDF allows aggregation between different types of
experimental data E.g. biological samples, experimental conditions
Future Work
Visualising nodes with large numbers of terms Ontology label ambiguities Linked Data output for SysMO SEEK and related
resources
Other Work Using RightField
KupKB – Kidney and Urinary Pathway knowledge base (http://www.kupkb.org)
Knowledge bases for inflammatory bowel disease and Chagas disease
BioBanking sample annotation Annotation of historical samples
‘Patient records' for Egyptian mummies, Manchester Museum
RightField Extension: Populous
Generic tool for populating ontology templates Supports validation at the point of data entry Expressive Pattern language for OWL Ontology
generation Helps biologists with ontology design patterns
http://www.e-lico.eu/populous
Simon Jupp, Robert Stevens, University of Manchester
Summary
RightField-enabled spreadsheets show a marked increase in the consistency of annotation when compared with free text annotation or other template approaches.
Success from embedding and hiding semantics and complexity
Recommended