51
Reference Data Integration: A Strategy For The Future Barry Smith National Center for Ontological Research University at Buffalo presented at FIMA, March 21, 2012 1

Reference Data Integration: A Strategy for the Future

Embed Size (px)

DESCRIPTION

2012 FIMA talk

Citation preview

Page 1: Reference Data Integration: A Strategy for the Future

Reference Data Integration: A Strategy For The Future

Barry SmithNational Center for Ontological Research

University at Buffalo

presented at FIMA, March 21, 2012

1

Page 2: Reference Data Integration: A Strategy for the Future

Who am I?National Center for Biomedical Ontology

based in Stanford Medical School, the Mayo Clinic and Buffalo Department of Philosophy

2

• Cleveland Clinic Semantic Database• Duke University Health System• University of Pittsburgh Medical Center• German Federal Ministry of Health• European Union eHealth Directorate• Plant Genome Research Resource• Protein Information Resource

Page 3: Reference Data Integration: A Strategy for the Future

Who am I?National Center for Ontological Research (http://ncor.us)

• Joint Warfighting Center, US Joint Forces Command • Intelligence and Information Warfare Directorate

(I2WD)• US Department of the Army Net-Centric Data

Strategy Center of Excellence• NextGen (Next Generation Air Transportation

System) Ontology Team• National Nuclear Security Administration (NNSA),

Department of Energy

3

Page 4: Reference Data Integration: A Strategy for the Future

Some questions

• How to find data?• How to understand data when you find it?• How to use data when you find it?• How to compare and integrate with other data?• How to avoid data silos?

4

Page 5: Reference Data Integration: A Strategy for the Future

The Web (net-centricity) as part of the solution

• You build a site• Others discover the site and they link to it• The more they link, the more well known the

page becomes (Google …)• Your data becomes discoverable

5

Page 6: Reference Data Integration: A Strategy for the Future

1. Make your data available in a standard way on the Web

2. Use controlled vocabularies (‘ontologies’) to capture common meanings, in ways understandable to both humans and computers – Web Ontology Language (OWL)

3. Build links among the datasets to create a ‘web of data’

The roots of Semantic Technology

Page 7: Reference Data Integration: A Strategy for the Future

Controlled vocabularies for tagging (‘annotating’) data

• Hardware changes rapidly• Organizations rapidly forming and

disbanding • Data is exploding• Meanings of common words change slowly • Use web architecture to annotate exploding

data stores using ontologies to capture these common meanings in a stable way

7

Page 8: Reference Data Integration: A Strategy for the Future

Where we stand today• increasing availability of semantically enhanced

data and semantic software• increasing use of XML, RDF, OWL in attempts to

create useful integration of on-line data and information

• “Linked Open Data” the New Big Thing

8

Page 9: Reference Data Integration: A Strategy for the Future

Ontology success stories, and some reasons for failure

9

Page 10: Reference Data Integration: A Strategy for the Future

as of September 2010

Page 11: Reference Data Integration: A Strategy for the Future

The problem: the more Semantic Technology is successful, they more it fails

The original idea was to break down silos via common controlled vocabularies for the tagging of data

The very success of the approach leads to the creation of ever new controlled vocabularies – semantic silos – as ever more ontologies are created in ad hoc ways

The Semantic Web framework as currently conceived and governed by the W3C yields minimal standardization

Multiplying (Meta)data registries are creating data cemeteries

11

Page 12: Reference Data Integration: A Strategy for the Future

NCBO Bioportal (Ontology Registry)

12

Page 13: Reference Data Integration: A Strategy for the Future

13/24

Page 14: Reference Data Integration: A Strategy for the Future

14/24

Page 15: Reference Data Integration: A Strategy for the Future

Reasons for this effect

• Low incentives for reuse of existing ontologies• Each organization wants its own ontology • Poor licensing regime, poor standards, poor

training• People think: Information technology

(hardware) is changing constantly, so it’s not worth the effort of getting things right

• People have egos: “We have done it this way for 30 years, we are not going to change now”

15

Page 16: Reference Data Integration: A Strategy for the Future

Why should you care?

• when they are many ad hoc systems, average quality will be low

• constant need for ad hoc repair through manual effort

• DoD alone spends $6 billion per annum on this problem

• regulatory agencies are recognizing the need for common controlled vocabularies

16/24

Page 17: Reference Data Integration: A Strategy for the Future

So now people are scrambling

• to learn how to create ontologies• serious lag in creating trained expertise• poor quality coding leads to poor quality

ontologies• poor quality ontology management

17

Page 18: Reference Data Integration: A Strategy for the Future

How to do it right?

• how create an incremental, evolutionary process, where what is good survives ?

• how to bring about ontology death ?

A success story from biology

18

Page 19: Reference Data Integration: A Strategy for the Future

Old biology data

19/

Page 20: Reference Data Integration: A Strategy for the Future

MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDV

New biology data

Page 21: Reference Data Integration: A Strategy for the Future

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20100

200

400

600

800

1000

1200

Series 1

Axis Title

Ontology in PubMed

Page 22: Reference Data Integration: A Strategy for the Future

By far the most successful: GO (Gene Ontology)

22

Page 23: Reference Data Integration: A Strategy for the Future

23

what cellular component?

what molecular function?

what biological process?

the Gene Ontology is not an ontology of genes

Page 24: Reference Data Integration: A Strategy for the Future

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)attacked

time

control

Puparial adhesionMolting cyclehemocyanin

Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes

Immune responseToll regulated genes

Amino acid catabolismLipid metobolism

Peptidase activityProtein catabloismImmune response

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

Microarray datashows changed expression ofthousands of genes.

How will you spot the patterns?

24

Page 25: Reference Data Integration: A Strategy for the Future

Why is GO successful

• built by bench biologists• multi-species, multi-disciplinary, open

source • compare use of kilograms, meters, seconds

in formulating experimental results• natural language and logical definitions for

all terms• initially low-tech to ensure aggressive use

and testing 25

Page 26: Reference Data Integration: A Strategy for the Future

now used not just in biology but also in hospital research

26

Page 27: Reference Data Integration: A Strategy for the Future

Lab / pathology dataEHR dataClinical trial dataFamily history data Medical imagingMicroarray dataModel organism dataFlow cytometryMass specGenotype / SNP data

How will you spot the patterns?How will you find the data you need?

27

Page 28: Reference Data Integration: A Strategy for the Future

over 11 million annotations relating UniProt, Ensembl and other databases to terms in the GO

28

Page 29: Reference Data Integration: A Strategy for the Future

29

Hierarchical view representing relations between represented types

Page 30: Reference Data Integration: A Strategy for the Future

~ $200 mill. invested in the GO so far

A new kind of biomedical researchOver 11 million GO annotations to biomedical research literature freely available on the web

Powerful software tool support for navigating this data means that what used to take researchers months of data comparison effort, can now be performed in milliseconds

30

Page 31: Reference Data Integration: A Strategy for the Future

If controlled vocabularies are to serve to remove silos

they have to be respected by many owners of data as resources that ensure accurate description of their data

– GO maintained not by computer scientists but by biologists

they have to be willingly used in annotations by many owners of data

they have to be maintained by persons who are trained in common principles of ontology maintenance

31

Page 32: Reference Data Integration: A Strategy for the Future

32

The new profession of biocurator

Page 33: Reference Data Integration: A Strategy for the Future

GO has been amazingly successful

Has created a community consensusHas created a web of feedback loops where

users of the GO can easily report errors and gaps

Has identified principles for successful ontology management

Indispensable to every drug company and every biology lab

33

Page 34: Reference Data Integration: A Strategy for the Future

But GO is limited in its scope

it covers only generic biological entities of three sorts:

– cellular components– molecular functions– biological processes

no diseases, symptoms, disease biomarkers, protein interactions, experimental processes …

34

Page 35: Reference Data Integration: A Strategy for the Future

Extending the GO methodology to other domains of biology and

medicine

35

Page 36: Reference Data Integration: A Strategy for the Future

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)OBO (Open Biomedical Ontology) Foundry proposal

(Gene Ontology in yellow) 36

Page 37: Reference Data Integration: A Strategy for the Future

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)The strategy of orthogonal modules

37

Page 38: Reference Data Integration: A Strategy for the Future

Ontology Scope URL Custodians

Cell Ontology (CL)

cell types from prokaryotes to mammals

obo.sourceforge.net/cgi-

bin/detail.cgi?cell

Jonathan Bard, Michael Ashburner, Oliver Hofman

Chemical Entities of Bio-

logical Interest (ChEBI)

molecular entities ebi.ac.uk/chebi Paula Dematos,Rafael Alcantara

Common Anatomy Refer-

ence Ontology (CARO)

anatomical structures in human and model

organisms(under development)

Melissa Haendel, Terry Hayamizu, Cornelius

Rosse, David Sutherland,

Foundational Model of Anatomy (FMA)

structure of the human body

fma.biostr.washington.

edu

JLV Mejino Jr.,Cornelius Rosse

Functional Genomics Investigation

Ontology (FuGO)

design, protocol, data instrumentation, and

analysisfugo.sf.net FuGO Working Group

Gene Ontology (GO)

cellular components, molecular functions, biological processes

www.geneontology.org

Gene Ontology Consortium

Phenotypic Quality Ontology

(PaTO)

qualities of anatomical structures

obo.sourceforge.net/cgi

-bin/ detail.cgi?attribute_and_value

Michael Ashburner, Suzanna

Lewis, Georgios Gkoutos

Protein Ontology (PrO)

protein types and modifications (under development) Protein Ontology

Consortium

Relation Ontology (RO)

relations obo.sf.net/relationship

Barry Smith, Chris Mungall

RNA Ontology(RnaO)

three-dimensional RNA structures (under development) RNA Ontology Consortium

Sequence Ontology(SO)

properties and features of nucleic sequences song.sf.net Karen Eilbeck

Page 39: Reference Data Integration: A Strategy for the Future

How to recreate the success of the GO in other areas

1. create a portal for sharing of information about existing controlled vocabularies, needs and institutions operating in a given area

2. create a library of ontologies in this area3. create a consortium of developers of these

ontologies who agree to pool their efforts to create a single set of non-overlapping ontology modules

– one ontology for each sub-area39

Page 40: Reference Data Integration: A Strategy for the Future

40

NextGen Ontology Portal

Portal

Comm

unitiesSearch

Ontology Library

NextGen Enterprise Ontology

Ontology Portal• Two-Tiered Registry

– NextGen Ontology – consist of vetted ontologies

– Ontology Library – open to the wider community

• Ontology Metadata– Ontology owner, domain, and

location • Ontology Search*

– Support ontology discovery

Page 41: Reference Data Integration: A Strategy for the Future

Developers commit in advance to collaborating with developers of ontologies in adjacent domains and

to working to ensure that, for each domain, there is community convergence on a single ontology

http://obofoundry.org

The OBO Foundry: a step-by-step, principles-based approach

41

Page 42: Reference Data Integration: A Strategy for the Future

OBO Foundry Principles

Common governance

Common training

Robust versioning

Common architecture

42

Page 43: Reference Data Integration: A Strategy for the Future

Anatomy Ontology(FMA*, CARO)

Environment

Ontology(EnvO)

Infectious Disease

Ontology(IDO*)

Biological Process

Ontology (GO*)

Cell Ontology

(CL)

CellularComponentOntology

(FMA*, GO*) Phenotypic Quality

Ontology(PaTO)

Subcellular Anatomy Ontology (SAO)

Sequence Ontology (SO*) Molecular

Function(GO*)Protein Ontology

(PRO*) OBO Foundry Modular Organization

top level

mid-level

domain level

Information Artifact Ontology

(IAO)

Ontology for Biomedical Investigations

(OBI)

Ontology of General Medical Science

(OGMS)

Basic Formal Ontology (BFO)

43

Page 44: Reference Data Integration: A Strategy for the Future

UCore 2.0 / UCore SL

Extension Strategy

44

top level

mid-level

domain level

Military domain ontologies as extensions of the Universal Core Semantic Layer

Page 45: Reference Data Integration: A Strategy for the Future

Existing efforts to create modular ontology suites

NASA Sweet OntologiesMilitary Intelligence Ontology FoundryPlanned OMG efforts:• OMG (CIA) Financial Event Ontology• Semantic Layer for ISO 20022 (Financial Industry Message Scheme)

Page 46: Reference Data Integration: A Strategy for the Future

46

Example: Financial Securities OntologyMike Bennett (2007)

Page 47: Reference Data Integration: A Strategy for the Future
Page 48: Reference Data Integration: A Strategy for the Future

Basic principles of ontology development

– for formulating definitions– of modularity– of user feedback for error correction and gap

identification– for ensuring compatibility between modules– for using ontologies to annotate legacy data– for using ontologies to create new data– for developing user-specific views

Page 49: Reference Data Integration: A Strategy for the Future

Modularity designed to ensure

• non-redundancy• annotations can be additive• division of labor among SMEs• lessons learned in one module can benefit work on

other modules• transferrable training • motivation of SME users

49

Page 50: Reference Data Integration: A Strategy for the Future

How the FIMA Reference Data community should solve this problem?

Major financial institutions Major software vendorsMajor data management companiesEDMC and government principals

– should pool information about the controlled vocabularies which already exist

– create a common library of these controlled vocabularies– create a subset of thought leaders who agree to pool their efforts

in the creation of a suite of ontology modules for common use– create a strategy to disseminate and evolve the selected modules– create a governance strategy to manage the modules over time– allow bad ontologies to die

Page 51: Reference Data Integration: A Strategy for the Future

Urgent need for trained ontologists

Severe shortage of persons with the needed expertiseUniversity at Buffalo Online Training and Certification Program for Ontologists

for details: [email protected]