Upload
alasdair-gray
View
3.021
Download
0
Embed Size (px)
DESCRIPTION
This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.
Citation preview
Computing Identity Co-reference Across Drug Discovery Datasets
Christian Y A Brenninkmeijer, Ian DunlopCarole Goble, Alasdair J G Gray, and
Steve Pettifer
www.openphacts.org [email protected]@open_phacts @gray_alasdair
Multiple Identities
Andy Law's Third Law“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
10/12/2013 SWAT4LS 2013 2
P12047X31045P120
47
GB:29384RS_
2353
Are these the same thing?
Gleevec® = Imatinib Mesylate
10/12/2013 SWAT4LS 2013 3
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N
10/12/2013 SWAT4LS 2013 4
10/12/2013 SWAT4LS 2013 5
Multiple Links: Different Reasons
10/12/2013 SWAT4LS 2013 7
Link: skos:closeMatchReason: non-salt form
Link: skos:exactMatchReason: drug name
Open PHACTS Discovery Platform
10/12/2013 SWAT4LS 2013 8
Drug Discovery Platform
Apps
Domain API
Interactive responses
Production qualityintegration platform
MethodCalls
OPS Discovery Platform
RDFNanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices
Identity Resolution
Service
Chemistry RegistrationNormalisation & Q/C
IdentifierManagement
Service
Indexing
Cor
e Pl
atfo
rm
P12374EC2.43.4
CS4532
“Adenosine receptor 2a”
RDF
VoID
Db
RDFNanopub
Db
VoID
RDF
Db
VoID
RDFNanopub
VoID
Public Content Commercial
Public Ontologies
User Annotations
Apps
10/12/2013 SWAT4LS 2013 9
Platform Interaction
10/12/2013 SWAT4LS 2013 10
Connectivity of Initial Linksets
10/12/2013 SWAT4LS 2013 11
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Compound Information
10/12/2013 SWAT4LS 2013 12
Genes == Proteins?
BRCA1Breast cancer type 1 susceptibility protein
10/12/2013 SWAT4LS 2013 13
http://en.wikipedia.org/wiki/File:Protein_BRCA1_PDB_1jm7.png
http://en.wikipedia.org/wiki/File:BRCA1_en.png
Proceed with Caution!
10/12/2013 SWAT4LS 2013 14
Co-reference Computation
Rules ensure• Unrestricted transitivity
within conceptual type• Restrict crossing
conceptual types
Based on justifications
Provenance captured
Target
Protein
Gene
10/12/2013 SWAT4LS 2013 15
0..*
0..*
0..*
0..1
0..1
Connectivity of Initial Linksets
10/12/2013 SWAT4LS 2013 16
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Connectivity of Computed Linksets
10/12/2013 SWAT4LS 2013 17
Datasets 37
Linksets 883
Links 17,383,846
Justifications 7
BridgeDb
10/12/2013 SWAT4LS 2013 18
Conclusions
• Computing co-reference advantageous– Requires less raw linksets– Larger coverage across datasets
• Rules ensure control– Genes can equal proteins– Compounds never equal proteins
• Provenance captured throughout
10/12/2013 SWAT4LS 2013 19