61
Bio-GraphIIn: a graph-based, integrative and semantically enabled repository for life science experimental data Alejandra González-Beltrán, PhD Oxford e-Research Centre, University of Oxford [email protected] @alegonbel NETTAB 2013 October 16-18, 2013 Venice Lido, Italy

NETTAB 2013

Embed Size (px)

DESCRIPTION

Presentation at NETTAB 2013, Venice Lido, Italy. October 2013. http://nettab.org/2013/progr.php

Citation preview

Page 1: NETTAB 2013

Bio-GraphIIn: a graph-based, integrative and semantically enabled

repository for life science experimental data

Alejandra González-Beltrán, PhDOxford e-Research Centre, University of [email protected] @alegonbel

NETTAB 2013 October 16-18, 2013 Venice Lido, Italy

Page 2: NETTAB 2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Experimental workflow

Page 3: NETTAB 2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

data+

Experimental workflow

Page 4: NETTAB 2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

data+

Science Reproducibility

Experimental workflow

Page 5: NETTAB 2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Reusability

Experimental workflow

Page 6: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences • retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary and future work

Page 7: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences • retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary and future work

Page 8: NETTAB 2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Motivation 1/4retrospective data submissions

Page 9: NETTAB 2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Motivation 1/4

retrospective

metadata

retrospective data submissions

Page 10: NETTAB 2013

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Motivation 1/4

retrospective

metadata

Metadata edits to repositories are not straightforward, often requiring deleting the

submission and re-submitting the data

retrospective data submissions

Page 11: NETTAB 2013

Motivation 1/4

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

prospective

metadata

metadata

metadata

metadata

metadata

retrospective data submissions

Page 12: NETTAB 2013

Motivation 1/4

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

metadata

prospective

metadata

metadata

metadata

metadata

metadata

Support incremental data deposition+ metadata edits

retrospective data submissions

Page 13: NETTAB 2013

Motivation 2/4 Data Collection

heterogeneous experimental data

Page 14: NETTAB 2013

Motivation 3/4 Publicationfragmentation of formats and databases

Page 15: NETTAB 2013

Motivation 4/4

• support for rich and uniform query interface across studies, enabling integrative data analysis to provide new insights at systems biology level

• e.g. find all data files associated with samples from a particular organism (e.g. Homo Sapiens) and particular tissue type (e.g. liver)

• allow to select a set of samples/data files through browsing, semantic filtering

• provide links to analysis and visualisation platforms

life science experiments repo

AnalysisVisualizationsemantic queries leading to integrative analysis

Page 16: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary

Page 17: NETTAB 2013

12

) infrastructureThe Investigation/Study/Assay (

generic format for experimental description and data exchange

open source software toolscommunity engagement

Page 18: NETTAB 2013

investigation

assay(s) assay(s)

data data

external files in native or other for-

mats

pointers to data file names/location

investigationhigh level concept to link related studies

studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays

assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)

• environmental health• environmental genomics• metabolomics• metagenomics• nanotechnology• proteomics

• stem cell discovery• system biology• transcriptomics• toxicogenomics• communities

working to build a library of cellular signatures

Page 19: NETTAB 2013

Experimental workflow - graph representation

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

Page 20: NETTAB 2013

Spreadsheets for end-users

vocabulary for the description of the experimental workflow

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

Experimental workflow - graph representation

Page 21: NETTAB 2013

Spreadsheets for end-users

vocabulary for the description of the experimental workflow

syntactic interoperabilityacross biological experiments of different types

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

Experimental workflow - graph representation

Page 22: NETTAB 2013

H1

semantic interoperabilityacross biological experiments of different types

H1.sample1

H1.sample2

Machine-readable representationGraph + Semantics

obi:material entity

tax:homosapiens

bfo:derives

from

obi:material sample

bfo:derives_from

labeling1

obi:material processing

obi:is_specifi

ed

_input _of

obi:processed material

H1.sample1.labeled

obi:is_specified

_output _of h1-s1.cel

isa:raw data file

obi:plannedprocess

scanning1

obi:is_specifi

ed

_input _of

obi:is_specified

_output _of

H1.sample2.labeled

labeling2 scanning2

obi:is_specifi

ed

_input _of

obi:is_specified

_output _of obi:is_specifi

ed

_input _of

obi:is_specified

_output _of h1-s2.cel

labeling protocol

obi:protocol

isa:

exec

utes

Page 23: NETTAB 2013

architecture)

ISA-TAB parser! isa2owl mapping!

parser!graph!

analysis!

Configuration!file!

Resource Description Framework(RDF)

mappings between the ISA-TAB syntax and ontologies

Page 24: NETTAB 2013

ISA$OBI'mapping'Ontology for Biomedical

Investigations

Page 25: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary

Page 26: NETTAB 2013

Bio-GraphIIn Requirements

Bio-GraphIIn (pronounced “bio-graphene”) stands for Biological Graph Investigation Index

BioInvestigation Index (BII)

Page 27: NETTAB 2013

Bio-GraphIIn Requirements

• support prospective annotation of experiments

• support Create Read Update Delete (CRUD) operations

• manage heterogeneous biological and biomedical metadata

• relying on ISA-TAB

• support data integration & semantic queries

• relying on ISA2OWL

• links to analysis and visualisation platforms

• take advantage of experimental design information, improving metadata such as including study groups

Page 28: NETTAB 2013

Data Types Format Browsing/Searching

Programmaticsubmission

Programmatic access

CRUD operations

Community curation

RDF

BioSample DB

ArrayExpress/GEO

SRA/ENA

PRIDE

BII

Bio-GraphIIn

sample info

Sample-TAB browse/search

X(email

submission)REST API X X YES

sequencing MAGE-TAB

browse/filter/search/

advanced search

MAGE-TAB spreadsheet/MIAMExpress

REST API X X X*

next generation sequencing

SRA-XMLbrowse/text/

sequence/advance search

Webin, REST REST API X X X

mass spectromet

ryPRIDE-ML

PRIDE inspector/

PRIDE Biomart

X(FTP upload) Java API X X X

All ISA-TAB browse/text search/filtering

X SOAP web services

X X X

All ISA-TAB

browse/filter/search/

advanced search

YES(upload, REST)

REST API YES YES YES

*We are referring to the ArrayExpress repository not to the Expression Atlas, which is available in RDF

Functionality provided by existing repositories& Bio-GrapIIn requirements

Page 29: NETTAB 2013

Data Types Format Browsing/Searching

Programmaticsubmission

Programmatic access

CRUD operations

Community curation

RDF

BioSample DB

ArrayExpress/GEO

SRA/ENA

PRIDE

BII

Bio-GraphIIn

sample info

Sample-TAB browse/search

X(email

submission)REST API X X YES

sequencing MAGE-TAB

browse/filter/search/

advanced search

MAGE-TAB spreadsheet/MIAMExpress

REST API X X X*

next generation sequencing

SRA-XMLbrowse/text/

sequence/advance search

Webin, REST REST API X X X

mass spectromet

ryPRIDE-ML

PRIDE inspector/

PRIDE Biomart

X(FTP upload) Java API X X X

All ISA-TAB browse/text search/filtering

X SOAP web services

X X X

All ISA-TAB

browse/filter/search/

advanced search

YES(upload, REST)

REST API YES YES YES

*We are referring to the ArrayExpress repository not to the Expression Atlas, which is available in RDF

Functionality provided by existing repositories& Bio-GrapIIn requirements

browsing

prototype

prototypeprototype

prototype

prototype

Page 30: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences • retrospective data submissions • heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary

Page 31: NETTAB 2013
Page 32: NETTAB 2013

semantic representation of the graph,rich queries over common

semantic framework enablingintegration with other repositories

Page 33: NETTAB 2013

independence from underlying graph technology

Page 34: NETTAB 2013

independence from underlying graph technology

http://www.tinkerpop.com/

property graphs

Page 35: NETTAB 2013
Page 36: NETTAB 2013

R SPARQL packagehttp://www.r-bloggers.com/sparql-with-r-in-less-than-5-minutes/

http://refinery-platform.org/

Django-based analysis and visualisationplatform, relies on ISA-TAB metadata

Page 37: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary and future work

Page 38: NETTAB 2013

SPARQL queries

PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX bfo: <http://purl.obolibrary.org/obo/BFO_>PREFIX iao: <http://purl.obolibrary.org/obo/IAO_>PREFIX obi: <http://purl.obolibrary.org/obo/OBI_>PREFIX tax: <http://purl.obolibrary.org/obo/NCBITaxon_>PREFIX isa: <http://purl.org/isa-tools/ISA_>PREFIX ro: <http://purl.obolibrary.org/obo/RO_>

SELECT DISTINCT?i_id ?s_id ?s_title ?organismWHERE { ?study rdf:type obi:0000066. ?study rdfs:label ?s_id. ?s_title_iri rdf:type obi:0001622. ?s_title_iri iao:0000219 ?study. ?s_title_iri isa:00000089 ?s_title. ?source rdf:type bfo:0000040. ?source obi:0000295 ?study.OPTIONAL { ?study bfo:0000050 ?investigation. ?investigation rdf:type obi:0000011. ?investigation rdfs:label ?i_id.}OPTIONAL { ?source rdf:type ?organism_iri. ?organism_iri rdf:type obi:0100026. ?organism_iri rdfs:label ?organism.}OPTIONAL { ?source bfo:0000053 ?characteristic. ?characteristic rdf:type bfo:0000005. ?characteristic rdfs:comment ?comment. ?characteristic rdfs:label ?organism. FILTER regex(str(?comment), "organism")}}

obi:investigation

obi:investigation_title

iao:denotes

Considering theoretical results on SPARQLto improve query performance, such asAND-OPT well-designed graph patterns

Pérez et al, Semantics and complexity of SPARQL, ACM Trans Database Syst. 2009

Letelier et al. Static analysis and optimization of semantic web queries

PODS 2012.

bfo:material_entityobi:is_specified_input_of

bfo:part_ofobi:planned_process

obi:organism

bfo:dependent continuant

Page 39: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions • heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary and future work

Page 40: NETTAB 2013
Page 41: NETTAB 2013

investigation studies assays

measurement technology

Page 42: NETTAB 2013
Page 43: NETTAB 2013
Page 44: NETTAB 2013
Page 45: NETTAB 2013
Page 46: NETTAB 2013
Page 47: NETTAB 2013
Page 48: NETTAB 2013
Page 49: NETTAB 2013
Page 50: NETTAB 2013
Page 51: NETTAB 2013
Page 52: NETTAB 2013
Page 53: NETTAB 2013
Page 54: NETTAB 2013
Page 55: NETTAB 2013
Page 56: NETTAB 2013

http://bii.oerc.ox.ac.uk

Page 57: NETTAB 2013

http://bii.oerc.ox.ac.uk

Page 58: NETTAB 2013

Outline

• Motivation for an integrative and semantically-enabled metadata repository in life sciences• retrospective data submissions• heterogeneous experimental data• fragmentation of formats and databases• semantic queries leading to integrative analysis

• Context: the ISA infrastructure

• Bio-GraphIIn requirements

• Bio-GraphIIn design & architecture

• Bio-GraphIIn graph queries

• Bio-GraphIIn prototype

• Summary and future work

Page 59: NETTAB 2013

Summary and future work

• Bio-GraphIIn - the new integrative and semantically -enabled repository for the ISA infrastructure: motivation, requirements, design & architecture, prototype

• Support for data integration, uniform semantic queries across experiments enabled by a common semantic framework (ISA2OWL)

• More work required on

• Querying: performance analysis, support for ad hoc queries

• Extension/improvement of prototype

• Interfaces to services (e.g. BioPortal) and analysis/visualisation platforms (e.g. R/Bioconductor & Refinery)

Page 60: NETTAB 2013

funders

Page 61: NETTAB 2013

Questions?

You can email [email protected]

View our bloghttp://isatools.wordpress.com

Follow us on Twitter@isatools

View our websitehttp://www.isa-tools.org

View our Git repo & contributehttp://github.com/ISA-tools

Thanks for your attention!