19
Computing Identity Co-reference Across Drug Discovery Datasets Christian Y A Brenninkmeijer, Ian Dunlop Carole Goble, Alasdair J G Gray, and Steve Pettifer www.openphacts.org [email protected] @open_phacts @gray_alasdair

Computing Identity Co-Reference Across Drug Discovery Datasets

Embed Size (px)

DESCRIPTION

This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.

Citation preview

Page 1: Computing Identity Co-Reference Across Drug Discovery Datasets

Computing Identity Co-reference Across Drug Discovery Datasets

Christian Y A Brenninkmeijer, Ian DunlopCarole Goble, Alasdair J G Gray, and

Steve Pettifer

www.openphacts.org [email protected]@open_phacts @gray_alasdair

Page 2: Computing Identity Co-Reference Across Drug Discovery Datasets

Multiple Identities

Andy Law's Third Law“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”

http://bioinformatics.roslin.ac.uk/lawslaws/

10/12/2013 SWAT4LS 2013 2

P12047X31045P120

47

GB:29384RS_

2353

Are these the same thing?

Page 3: Computing Identity Co-Reference Across Drug Discovery Datasets

Gleevec® = Imatinib Mesylate

10/12/2013 SWAT4LS 2013 3

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N

Page 4: Computing Identity Co-Reference Across Drug Discovery Datasets

10/12/2013 SWAT4LS 2013 4

Page 5: Computing Identity Co-Reference Across Drug Discovery Datasets

10/12/2013 SWAT4LS 2013 5

Page 6: Computing Identity Co-Reference Across Drug Discovery Datasets

Multiple Links: Different Reasons

10/12/2013 SWAT4LS 2013 7

Link: skos:closeMatchReason: non-salt form

Link: skos:exactMatchReason: drug name

Page 7: Computing Identity Co-Reference Across Drug Discovery Datasets

Open PHACTS Discovery Platform

10/12/2013 SWAT4LS 2013 8

Drug Discovery Platform

Apps

Domain API

Interactive responses

Production qualityintegration platform

MethodCalls

Page 8: Computing Identity Co-Reference Across Drug Discovery Datasets

OPS Discovery Platform

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

Indexing

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

10/12/2013 SWAT4LS 2013 9

Page 9: Computing Identity Co-Reference Across Drug Discovery Datasets

Platform Interaction

10/12/2013 SWAT4LS 2013 10

Page 10: Computing Identity Co-Reference Across Drug Discovery Datasets

Connectivity of Initial Linksets

10/12/2013 SWAT4LS 2013 11

Datasets 37

Linksets 104

Links 7,096,712

Justifications 7

Page 11: Computing Identity Co-Reference Across Drug Discovery Datasets

Compound Information

10/12/2013 SWAT4LS 2013 12

Page 12: Computing Identity Co-Reference Across Drug Discovery Datasets

Genes == Proteins?

BRCA1Breast cancer type 1 susceptibility protein

10/12/2013 SWAT4LS 2013 13

http://en.wikipedia.org/wiki/File:Protein_BRCA1_PDB_1jm7.png

http://en.wikipedia.org/wiki/File:BRCA1_en.png

Page 13: Computing Identity Co-Reference Across Drug Discovery Datasets

Proceed with Caution!

10/12/2013 SWAT4LS 2013 14

Page 14: Computing Identity Co-Reference Across Drug Discovery Datasets

Co-reference Computation

Rules ensure• Unrestricted transitivity

within conceptual type• Restrict crossing

conceptual types

Based on justifications

Provenance captured

Target

Protein

Gene

10/12/2013 SWAT4LS 2013 15

0..*

0..*

0..*

0..1

0..1

Page 15: Computing Identity Co-Reference Across Drug Discovery Datasets

Connectivity of Initial Linksets

10/12/2013 SWAT4LS 2013 16

Datasets 37

Linksets 104

Links 7,096,712

Justifications 7

Page 16: Computing Identity Co-Reference Across Drug Discovery Datasets

Connectivity of Computed Linksets

10/12/2013 SWAT4LS 2013 17

Datasets 37

Linksets 883

Links 17,383,846

Justifications 7

Page 17: Computing Identity Co-Reference Across Drug Discovery Datasets

BridgeDb

10/12/2013 SWAT4LS 2013 18

Page 18: Computing Identity Co-Reference Across Drug Discovery Datasets

Conclusions

• Computing co-reference advantageous– Requires less raw linksets– Larger coverage across datasets

• Rules ensure control– Genes can equal proteins– Compounds never equal proteins

• Provenance captured throughout

10/12/2013 SWAT4LS 2013 19

Page 19: Computing Identity Co-Reference Across Drug Discovery Datasets

Questions

[email protected]/~ajg33@gray_alasdair

Open PHACTS Project

[email protected]@open_phacts