Upload
alejandra-gonzalez-beltran
View
107
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation at NETTAB 2013, Venice Lido, Italy. October 2013. http://nettab.org/2013/progr.php
Citation preview
Bio-GraphIIn: a graph-based, integrative and semantically enabled
repository for life science experimental data
Alejandra González-Beltrán, PhDOxford e-Research Centre, University of [email protected] @alegonbel
NETTAB 2013 October 16-18, 2013 Venice Lido, Italy
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
data+
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
data+
Science Reproducibility
Experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Reusability
Experimental workflow
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences • retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary and future work
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences • retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary and future work
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Motivation 1/4retrospective data submissions
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Motivation 1/4
retrospective
metadata
retrospective data submissions
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Motivation 1/4
retrospective
metadata
Metadata edits to repositories are not straightforward, often requiring deleting the
submission and re-submitting the data
retrospective data submissions
Motivation 1/4
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
prospective
metadata
metadata
metadata
metadata
metadata
retrospective data submissions
Motivation 1/4
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
metadata
prospective
metadata
metadata
metadata
metadata
metadata
Support incremental data deposition+ metadata edits
retrospective data submissions
Motivation 2/4 Data Collection
heterogeneous experimental data
Motivation 3/4 Publicationfragmentation of formats and databases
Motivation 4/4
• support for rich and uniform query interface across studies, enabling integrative data analysis to provide new insights at systems biology level
• e.g. find all data files associated with samples from a particular organism (e.g. Homo Sapiens) and particular tissue type (e.g. liver)
• allow to select a set of samples/data files through browsing, semantic filtering
• provide links to analysis and visualisation platforms
life science experiments repo
AnalysisVisualizationsemantic queries leading to integrative analysis
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary
12
) infrastructureThe Investigation/Study/Assay (
generic format for experimental description and data exchange
open source software toolscommunity engagement
investigation
assay(s) assay(s)
data data
external files in native or other for-
mats
pointers to data file names/location
investigationhigh level concept to link related studies
studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays
assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)
• environmental health• environmental genomics• metabolomics• metagenomics• nanotechnology• proteomics
• stem cell discovery• system biology• transcriptomics• toxicogenomics• communities
working to build a library of cellular signatures
Experimental workflow - graph representation
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
Spreadsheets for end-users
vocabulary for the description of the experimental workflow
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
Experimental workflow - graph representation
Spreadsheets for end-users
vocabulary for the description of the experimental workflow
syntactic interoperabilityacross biological experiments of different types
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
Experimental workflow - graph representation
H1
semantic interoperabilityacross biological experiments of different types
H1.sample1
H1.sample2
Machine-readable representationGraph + Semantics
obi:material entity
tax:homosapiens
bfo:derives
from
obi:material sample
bfo:derives_from
labeling1
obi:material processing
obi:is_specifi
ed
_input _of
obi:processed material
H1.sample1.labeled
obi:is_specified
_output _of h1-s1.cel
isa:raw data file
obi:plannedprocess
scanning1
obi:is_specifi
ed
_input _of
obi:is_specified
_output _of
H1.sample2.labeled
labeling2 scanning2
obi:is_specifi
ed
_input _of
obi:is_specified
_output _of obi:is_specifi
ed
_input _of
obi:is_specified
_output _of h1-s2.cel
labeling protocol
obi:protocol
isa:
exec
utes
architecture)
ISA-TAB parser! isa2owl mapping!
parser!graph!
analysis!
Configuration!file!
Resource Description Framework(RDF)
mappings between the ISA-TAB syntax and ontologies
ISA$OBI'mapping'Ontology for Biomedical
Investigations
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary
Bio-GraphIIn Requirements
Bio-GraphIIn (pronounced “bio-graphene”) stands for Biological Graph Investigation Index
BioInvestigation Index (BII)
Bio-GraphIIn Requirements
• support prospective annotation of experiments
• support Create Read Update Delete (CRUD) operations
• manage heterogeneous biological and biomedical metadata
• relying on ISA-TAB
• support data integration & semantic queries
• relying on ISA2OWL
• links to analysis and visualisation platforms
• take advantage of experimental design information, improving metadata such as including study groups
Data Types Format Browsing/Searching
Programmaticsubmission
Programmatic access
CRUD operations
Community curation
RDF
BioSample DB
ArrayExpress/GEO
SRA/ENA
PRIDE
BII
Bio-GraphIIn
sample info
Sample-TAB browse/search
X(email
submission)REST API X X YES
sequencing MAGE-TAB
browse/filter/search/
advanced search
MAGE-TAB spreadsheet/MIAMExpress
REST API X X X*
next generation sequencing
SRA-XMLbrowse/text/
sequence/advance search
Webin, REST REST API X X X
mass spectromet
ryPRIDE-ML
PRIDE inspector/
PRIDE Biomart
X(FTP upload) Java API X X X
All ISA-TAB browse/text search/filtering
X SOAP web services
X X X
All ISA-TAB
browse/filter/search/
advanced search
YES(upload, REST)
REST API YES YES YES
*We are referring to the ArrayExpress repository not to the Expression Atlas, which is available in RDF
Functionality provided by existing repositories& Bio-GrapIIn requirements
Data Types Format Browsing/Searching
Programmaticsubmission
Programmatic access
CRUD operations
Community curation
RDF
BioSample DB
ArrayExpress/GEO
SRA/ENA
PRIDE
BII
Bio-GraphIIn
sample info
Sample-TAB browse/search
X(email
submission)REST API X X YES
sequencing MAGE-TAB
browse/filter/search/
advanced search
MAGE-TAB spreadsheet/MIAMExpress
REST API X X X*
next generation sequencing
SRA-XMLbrowse/text/
sequence/advance search
Webin, REST REST API X X X
mass spectromet
ryPRIDE-ML
PRIDE inspector/
PRIDE Biomart
X(FTP upload) Java API X X X
All ISA-TAB browse/text search/filtering
X SOAP web services
X X X
All ISA-TAB
browse/filter/search/
advanced search
YES(upload, REST)
REST API YES YES YES
*We are referring to the ArrayExpress repository not to the Expression Atlas, which is available in RDF
Functionality provided by existing repositories& Bio-GrapIIn requirements
browsing
prototype
prototypeprototype
prototype
prototype
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences • retrospective data submissions • heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary
semantic representation of the graph,rich queries over common
semantic framework enablingintegration with other repositories
independence from underlying graph technology
independence from underlying graph technology
http://www.tinkerpop.com/
property graphs
R SPARQL packagehttp://www.r-bloggers.com/sparql-with-r-in-less-than-5-minutes/
http://refinery-platform.org/
Django-based analysis and visualisationplatform, relies on ISA-TAB metadata
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary and future work
SPARQL queries
PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX bfo: <http://purl.obolibrary.org/obo/BFO_>PREFIX iao: <http://purl.obolibrary.org/obo/IAO_>PREFIX obi: <http://purl.obolibrary.org/obo/OBI_>PREFIX tax: <http://purl.obolibrary.org/obo/NCBITaxon_>PREFIX isa: <http://purl.org/isa-tools/ISA_>PREFIX ro: <http://purl.obolibrary.org/obo/RO_>
SELECT DISTINCT?i_id ?s_id ?s_title ?organismWHERE { ?study rdf:type obi:0000066. ?study rdfs:label ?s_id. ?s_title_iri rdf:type obi:0001622. ?s_title_iri iao:0000219 ?study. ?s_title_iri isa:00000089 ?s_title. ?source rdf:type bfo:0000040. ?source obi:0000295 ?study.OPTIONAL { ?study bfo:0000050 ?investigation. ?investigation rdf:type obi:0000011. ?investigation rdfs:label ?i_id.}OPTIONAL { ?source rdf:type ?organism_iri. ?organism_iri rdf:type obi:0100026. ?organism_iri rdfs:label ?organism.}OPTIONAL { ?source bfo:0000053 ?characteristic. ?characteristic rdf:type bfo:0000005. ?characteristic rdfs:comment ?comment. ?characteristic rdfs:label ?organism. FILTER regex(str(?comment), "organism")}}
obi:investigation
obi:investigation_title
iao:denotes
Considering theoretical results on SPARQLto improve query performance, such asAND-OPT well-designed graph patterns
Pérez et al, Semantics and complexity of SPARQL, ACM Trans Database Syst. 2009
Letelier et al. Static analysis and optimization of semantic web queries
PODS 2012.
bfo:material_entityobi:is_specified_input_of
bfo:part_ofobi:planned_process
obi:organism
bfo:dependent continuant
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions • heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary and future work
investigation studies assays
measurement technology
Outline
• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis
• Context: the ISA infrastructure
• Bio-GraphIIn requirements
• Bio-GraphIIn design & architecture
• Bio-GraphIIn graph queries
• Bio-GraphIIn prototype
• Summary and future work
Summary and future work
• Bio-GraphIIn - the new integrative and semantically -enabled repository for the ISA infrastructure: motivation, requirements, design & architecture, prototype
• Support for data integration, uniform semantic queries across experiments enabled by a common semantic framework (ISA2OWL)
• More work required on
• Querying: performance analysis, support for ad hoc queries
• Extension/improvement of prototype
• Interfaces to services (e.g. BioPortal) and analysis/visualisation platforms (e.g. R/Bioconductor & Refinery)
funders
Questions?
You can email [email protected]
View our bloghttp://isatools.wordpress.com
Follow us on Twitter@isatools
View our websitehttp://www.isa-tools.org
View our Git repo & contributehttp://github.com/ISA-tools
Thanks for your attention!