40
Eamonn Maguire Lead Software Engineer Oxford University ISCB-Asia, 17th December 2012 The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe [email protected]

Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

Embed Size (px)

DESCRIPTION

Eamonn Maguire's talk on "The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe" at ISCB-Asia, December 17th 2012

Citation preview

Page 1: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

Eamonn MaguireLead Software EngineerOxford University

ISCB-Asia, 17th December 2012

The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

[email protected]

Page 2: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

What is ISA all about?

ISCB-Asia, 17th December 2012

We want to enable better reporting of experiments...

We want to make to easier for submitters...

We want to provide tooling which biologists will want to use...

Page 3: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

What’s the problem?

ISCB-Asia, 17th December 2012

Could be beans. Could be peas. Could be soup.

Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.

In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language.

1. there is fragmentation in formats: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML.2. different formats often capture different information - often not enough to actually repeat an experiment correctly3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens or rat vs rattus norvegicus, making search more difficult.

Tin can analogy borrowed from Norman Morrison & converted

from ontologies to metadata transfer standards.

Page 4: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

What’s the problem?

ISCB-Asia, 17th December 2012

Could be beans. Could be peas. Could be soup.

Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.

In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language.

1. there is fragmentation in formats: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML.2. different formats often capture different information - often not enough to actually repeat an experiment correctly3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens or rat vs rattus norvegicus, making search more difficult.

可能是豌豆 - a different representation...non latin language

Tin can analogy borrowed from Norman Morrison & converted

from ontologies to metadata transfer standards.

Page 5: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

What’s the problem?

ISCB-Asia, 17th December 2012

Could be beans. Could be peas. Could be soup.

Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.

In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language.

1. there is fragmentation in formats: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML.2. different formats often capture different information - often not enough to actually repeat an experiment correctly3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens or rat vs rattus norvegicus, making search more difficult.

可能是豌豆 - a different representation...non latin languageMight be petit pois - a different terminology

Tin can analogy borrowed from Norman Morrison & converted

from ontologies to metadata transfer standards.

Page 6: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

1. There is fragmentation in formats

ISCB-Asia, 17th December 2012

Can you imagine having to translate everything you write into a different language in order to submit your data?

Page 7: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

1. There is fragmentation in formats

ISCB-Asia, 17th December 2012

Can you imagine having to translate everything you write into a different language in order to submit your data?

你能想象有翻译成不同的语言编写的一切,以提交您的数据吗?即使转换工具,像谷歌,翻译弄错了。

Page 8: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

1. There is fragmentation in formats

ISCB-Asia, 17th December 2012

Can you imagine having to translate everything you write into a different language in order to submit your data?

你能想象有翻译成不同的语言编写的一切,以提交您的数据吗?即使转换工具,像谷歌,翻译弄错了。

An féidir leat a shamhlú go bhfuil gach rud a scríobh tú a aistriú isteach i dteanga eile d'fhonn a chur isteach do chuid sonraí? Fiú uirlisí chomhshó, cosúil le google translate a fháilsé mícheart.

Page 9: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Repositories are making it difficult for biologists to submit data, and for others to use it. Particularly for those performing multi-omic experiments...to submit say proteomic and transcriptomic data, one must provide slightly different information in two very different formats...why?

Our solution is one general purpose, flexible format, herein referred to as ISA-Tab.

A domain agnostic format to capture experimental metadata in omic experiments (transcriptomic, genomic, proteomic, metabolomic) as well as traditional experiments such as clinical chemistry and histology.

...it already works in lots of domains...nutrigenomics, toxicogenomics, public health... etc.

1. There is fragmentation in formats: our solution

Page 10: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

investigation

assay(s) assay(s)

data data

external files in native or other for-

mats

pointers to data file names/location

investigationhigh level concept to link related studies

studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays

assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)

Biologists like tab. They don’t like XML.

Through basic inference...ISA-Tab is good :)

1. There is fragmentation in formats: our solution

Page 11: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Minimal Information about a Biological or Biomedical Investigation.

The information captured by a format is generated via a ‘checklist’, ideally a list of fields that together provide the minimal amount of information required to be able to reproduce an experiment.

MIBBI is trying to harmonise these checklists to reduce redundancy and make them interoperable.

2. Different formats often capture different information...But there are lots of similarities

We have 32 checklists at present because there are differences in what is deemed important depending on the experiment being performed.

Page 12: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

Now integrated in

ISCB-Asia, 17th December 2012

Helping to demystify the unwieldy world of standards...

Find out what standards are out there...MI Checklists, ontologies and formats plus what domains they are suited to...

Find out about data sharing policies from NIH for example.

Databases, which standards they use etc.

Page 13: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

Now integrated in

ISCB-Asia, 17th December 2012

In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language. What do I mean by this? Well...

1. there is fragmentation:

2. different formats often capture different information

3. the terminologies used to describe an experiment are different: we promote the use of ontologies to harmonize the recording of experiments.

Page 14: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

The ISA tools...

ISCB-Asia, 17th December 2012

ISA tools brings together a common representation, MI checklists and ontologies.

Common representation

MI ChecklistsOntologies

Page 15: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

The ISA tools

ISCB-Asia, 17th December 2012

Developed on top of the ISA-Tab format...modular, configurable, open source, Java based*

See them all at isa-tools.org

Page 16: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

The ISA tools... a tool for all your needs

ISCB-Asia, 17th December 2012

Page 17: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

Configurable...

ISCB-Asia, 17th December 2012

So, our infrastructure is built upon XML files. These are created by the ISAConfigurator.

A configuration XML file describes the fields (or checklist) required to describe a particular experiment and any ontologies to be used.

We need to support lots of different checklists, and it should be easy for people to change their requirements should they need to....

Page 18: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Create configuration xml files

Page 19: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

isacreatorCreate & Edit ISA-Tab

Page 20: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

isacreator

Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features... powered by ncbo annotator

visualise helpsuggesttagterms clear all

spreadsheet-like interfaceautomated ontology tagging

QR code generator

publication searcher

ontology search

visualization

file chooser

But these are just some of them...we also have a data entry wizard and an import utility...

The ISAcreator...

Page 21: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

Ontology search and automated annotation in Google Docs

Page 22: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe
Page 23: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Make sure the ISA-Tab is correct

Page 24: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

or...

validate from the command line...or...

within ISAcreator directly...

validate from the dedicated tool...

Page 25: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Convert to or from differing formats

Page 26: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Converts MAGE-Tab to ISA-Tab.This is still in beta, however we are getting close to a fully working version. We’ve successfully

creating validated ISA-Tab for ~90% of the 21k experiments in ArrayExpress

Available as a web service, web interface and source is available for running conversions locally

The converters

http://isatab.sourceforge.net/magetoisa/

Fully Endorsed by ArrayExpress, PRIDE and the European Nucleotide Archive (ENA)...

Page 27: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Saghantelian_1,

KO1,

KO1_extract,

Sample,collec5on,

processed,,material,

./cdf/KO/ko15.CDF,

Informa5on,content,en5ty,

extrac5on, material,,processing,

mass,spectrometry,

has,specified,input,

has,specified,input,

has,specified,input,

has,specified,output,

has,specified,output,

has,specified,output,

derives,from,

derives,from,

derives,from,

type,

type,

type,

type,

type,

type,

material(en*ty(

The converters...semantic web

Page 28: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

The converters...semantic web

•Make the semantics of ISAtab explicit, including materials & data entities & processes•Exploit the semantic annotations available in ISAtab datasets•Augment ISA syntax with new elements (e.g. groups), facilitating the

understanding & querying of experimental design•Facilitate querying, data integration & knowledge discovery/reasoning

Page 29: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

The converters...semantic web

Notes&in&Lab&books&(informa1on&for&humans)&

Spreadsheets&&&Tables&(ISAtab&metadata)&

Facts&as&RDF&statements&(informa1on&for&machines)&

Page 30: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Get ISA-Tab into a databaseShare it (or don’t) with the world

Page 31: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Database & Web Application

Page 32: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Web application

Page 33: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Web application

Page 34: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Web application

Page 35: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Analysis

Last but not least...

Page 36: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Package to read ISA-Tab into R, especially BioConductor to run analysis scripts on your data...

It can automatically call microarray, mass spec and flow cytometry analysis packages on appropriate datasets...

There is also a script to create Galaxy libraries from ISA-Tab

Available from BioConductor...

Brad Chapman is working on this at HSPH

Dedicated ISAcreator mode. Allows for persistence and perusal of ISA experiments in GenomeSpace

Page 37: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

isacommons

S t e m C e ll C o m m o n sNanotechnology

Informatics Working Group

A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including:

Page 38: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

Page 39: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

ISCB-Asia, 17th December 2012

ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community levelPhilippe Rocca-Serra; Marco Brandizi; Eamonn Maguire; Nataliya Sklyar ; Chris Taylor ; Kimberly Begley; Dawn Field; Stephen Harris; Winston Hide; Oliver Hofmann; Steffen Neumann; Peter Sterk; Weida Tong; Susanna-Assunta SansoneBioinformatics 2010 26: 2354-2356

Towards Interoperable Bioscience DataSansone SA, Rocca-Serra P, Field D, Maguire E et alNature Genetics 2012

Page 40: Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe

Questions??

ISCB-Asia, 17th December 2012

You can email [email protected]

View our bloghttp://isatools.wordpress.com

Follow us on Twitter@isatools

View our websitehttp://www.isa-tools.org

Thanks for listening...

View our Git repo & contributehttp://github.com/ISA-tools