Computing Identity Co-reference Across Drug Discovery Datasets
Christian Y A Brenninkmeijer, Ian DunlopCarole Goble, Alasdair J G Gray, and
Steve Pettifer
www.openphacts.org [email protected]@open_phacts @gray_alasdair
Multiple Identities
Andy Law's Third Law“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
10/12/2013 SWAT4LS 2013 2
P12047X31045P120
47
GB:29384RS_
2353
Are these the same thing?
Gleevec® = Imatinib Mesylate
10/12/2013 SWAT4LS 2013 3
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N
10/12/2013 SWAT4LS 2013 4
10/12/2013 SWAT4LS 2013 5
Multiple Links: Different Reasons
10/12/2013 SWAT4LS 2013 7
Link: skos:closeMatchReason: non-salt form
Link: skos:exactMatchReason: drug name
Open PHACTS Discovery Platform
10/12/2013 SWAT4LS 2013 8
Drug Discovery Platform
Apps
Domain API
Interactive responses
Production qualityintegration platform
MethodCalls
OPS Discovery Platform
RDFNanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices
Identity Resolution
Service
Chemistry RegistrationNormalisation & Q/C
IdentifierManagement
Service
Indexing
Cor
e Pl
atfo
rm
P12374EC2.43.4
CS4532
“Adenosine receptor 2a”
RDF
VoID
Db
RDFNanopub
Db
VoID
RDF
Db
VoID
RDFNanopub
VoID
Public Content Commercial
Public Ontologies
User Annotations
Apps
10/12/2013 SWAT4LS 2013 9
Platform Interaction
10/12/2013 SWAT4LS 2013 10
Connectivity of Initial Linksets
10/12/2013 SWAT4LS 2013 11
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Compound Information
10/12/2013 SWAT4LS 2013 12
Genes == Proteins?
BRCA1Breast cancer type 1 susceptibility protein
10/12/2013 SWAT4LS 2013 13
http://en.wikipedia.org/wiki/File:Protein_BRCA1_PDB_1jm7.png
http://en.wikipedia.org/wiki/File:BRCA1_en.png
Proceed with Caution!
10/12/2013 SWAT4LS 2013 14
Co-reference Computation
Rules ensure• Unrestricted transitivity
within conceptual type• Restrict crossing
conceptual types
Based on justifications
Provenance captured
Target
Protein
Gene
10/12/2013 SWAT4LS 2013 15
0..*
0..*
0..*
0..1
0..1
Connectivity of Initial Linksets
10/12/2013 SWAT4LS 2013 16
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Connectivity of Computed Linksets
10/12/2013 SWAT4LS 2013 17
Datasets 37
Linksets 883
Links 17,383,846
Justifications 7
BridgeDb
10/12/2013 SWAT4LS 2013 18
Conclusions
• Computing co-reference advantageous– Requires less raw linksets– Larger coverage across datasets
• Rules ensure control– Genes can equal proteins– Compounds never equal proteins
• Provenance captured throughout
10/12/2013 SWAT4LS 2013 19