Upload
maulik-kamdar
View
407
Download
0
Embed Size (px)
Citation preview
Investigating Term Reuse and Overlap in Biomedical Ontologies
International Conference on Biomedical Ontology Lisbon, 27th -30th July 2015
M A U L I K R . K A M D A R , TA N I A T U D O R A C H E A N D M A R K A . M U S E N
Are we there yet?
C0011849Diabetes Mellitus
Diabetes Mellitus
Unified Medical Language System (UMLS)
SNOMEDCT ICD9CM
C0011849Diabetes Mellitus
Diabetes Mellitus
Unified Medical Language System (UMLS)
Open Biomedical Ontologies (OBO) Foundry
SNOMEDCT ICD9CM
Binding to RNA(GRO#BindingToRNA)GO:0003723
IRI xrefRNA Binding (GO:0003723)
Gene Expression Ontology (GEXO)
Gene Regulation Ontology (GEXO)Gene Ontology (GO)
Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.
OBO Reuse vs Overlap in 2010
Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.
OBO Reuse vs Overlap in 2010
Same IRI
Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.
OBO Reuse vs Overlap in 2010
Same IRI
Intent for Reuse
Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.
OBO Reuse vs Overlap in 2010
Xref mapping
Same IRI
Intent for Reuse
Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.
OBO Reuse vs Overlap in 2010
September 2009
Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.
OBO Reuse vs Overlap in 2010
September 2010
Key Findings
Key Findings
~3% Term Reuse Only popular or upper-
level ontologies reused 14.4% Term Overlap
Key Findings
~3% Term Reuse Only popular or upper-
level ontologies reused 14.4% Term Overlap
Semantically-similar terms reused together
Similarity metric for a Recommender system
BioPortal Import Plugin
DOG4DAG
Ontofox Web tool
Neurological Disease Ontology
Neurological Disease Ontology
OBIReuse of an Ontology
Neurological Disease Ontology
Reuse of TermsOGMS
Neurological Disease Ontology
NDO
Key Findings
~3% Term Reuse Only popular or upper-
level ontologies reused 14.4% Term Overlap
Semantically-similar terms reused together
Similarity metric for a Recommender system
BioPortal N-triples dump
Biomedical Ontologies
Terms, Labels, xrefs, CUIs
Xref ReuseIRI Reuse CUI Reuse
Clustering Determine Source Ontology
Term Overlap Analysis
509 ontologies
377 ontologies
Remove ontology views
5,718,276 class terms
Label normalization
Source-Target Ontology pairs
>35% reuse for ontology reuse
14.4% Naïve Term Overlap!
• Normalized String Matching on Term Labels
14.4%(823621)
156/377 ontologies reuse no terms from other ontologies!
<5% of Terms reused from other Ontologies!>
IRI Reuse
156/377 ontologies reuse no terms from other ontologies!
<5% of Terms reused from other Ontologies!>
IRI Reuse
156/377 ontologies reuse no terms from other ontologies!
<5% of Terms reused from other Ontologies!>
IRI Reuse
315/377 ontologies xref link to no terms from other ontologies!
<5% of Terms reused from other Ontologies!>
Xref Reuse
263/377 ontologies have no terms reused by other ontologies!
Reuse from a small set of ontologies only!>
IRI Reuse
286/377 ontologies have no terms xref linked by other ontologies!
Reuse from a small set of ontologies only!>
Xref Reuse
0-5% of total terms reused explicitly or using xref, with >150 ontologies showing 0% reuse. Average Term Reuse ~ 3%
Reuse from a small set of ontologies only with terms from >250 ontologies never reused
>100% term reuse from some ontologies! Why?
BFO GO IAO
OBI
PATO
CHEB
I
CL
NCB
ITAX
ON UO SO
UBER
ON
CARO
NCI
T
FMA
MP
SNO
MED
CT
0
10
20
30
40
50
60
70
80
90
100
Ontologies
Num
ber o
f Ont
olog
ies R
eusin
g Te
rms (
#)
>100% terms reused from some ontologies!
xref Reuse (No. of Ontologies
IRI Reuse (No. of Ontologies)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
10
20
30
40
50
60
70
80
90
100
Ontologies
Num
ber o
f Ont
olog
ies R
eusin
g Te
rms (
#)
>100% terms reused from some ontologies!
% of Terms reused IRIs
% of Terms reused xref
BFO:101/39
… Reuse from a small set of popular or upper-level ontologies only with terms from >250 ontologies never reused
>100% terms reused w.r.t current version of the BFO, PATO, CARO, UO, SO ontologies! Needs rigorous analysis through term overlap …
ICD1
0PCS
HCPC
SN
CBIT
AXO
NLO
INC
MES
HHL
7IC
D10C
MO
MIM
RXN
ORM CP
TPD
QM
EDDR
AIC
D9CM
NDD
FIC
PCIC
PC2P
MDD
BN
DFRT
SNO
MED
CTVA
NDF
CRIS
PRC
DM
EDLI
NE.
..SN
MI
COST
ART
WHO
-ART
Procedural Terminologies do not share CUIs!
CUIs shared
0 Terminologies
CUI Reuse
Nu
mb
er o
f Ter
ms
(Log
Sca
le)
ICD1
0PCS
HCPC
SN
CBIT
AXO
NLO
INC
MES
HHL
7IC
D10C
MO
MIM
RXN
ORM CP
TPD
QM
EDDR
AIC
D9CM
NDD
FIC
PCIC
PC2P
MDD
BN
DFRT
SNO
MED
CTVA
NDF
CRIS
PRC
DM
EDLI
NE.
..SN
MI
COST
ART
WHO
-ART
Procedural Terminologies do not share CUIs!
CUIs shared1-5 Terminologies
CUI Reuse
Nu
mb
er o
f Ter
ms
(Log
Sca
le)
ICD1
0PCS
HCPC
SN
CBIT
AXO
NLO
INC
MES
HHL
7IC
D10C
MO
MIM
RXN
ORM CP
TPD
QM
EDDR
AIC
D9CM
NDD
FIC
PCIC
PC2P
MDD
BN
DFRT
SNO
MED
CTVA
NDF
CRIS
PRC
DM
EDLI
NE.
..SN
MI
COST
ART
WHO
-ART
Procedural Terminologies do not share CUIs!
CUIs shared
6-10 Terminologies
CUI Reuse
Nu
mb
er o
f Ter
ms
(Log
Sca
le)
ICD1
0PCS
HCPC
SN
CBIT
AXO
NLO
INC
MES
HHL
7IC
D10C
MO
MIM
RXN
ORM CP
TPD
QM
EDDR
AIC
D9CM
NDD
FIC
PCIC
PC2P
MDD
BN
DFRT
SNO
MED
CTVA
NDF
CRIS
PRC
DM
EDLI
NE.
..SN
MI
COST
ART
WHO
-ART
Procedural Terminologies do not share CUIs!
CUIs shared
11-15 Terminologies
CUI Reuse
Nu
mb
er o
f Ter
ms
(Log
Sca
le)
ICD1
0PCS
HCPC
SN
CBIT
AXO
NLO
INC
MES
HHL
7IC
D10C
MO
MIM
RXN
ORM CP
TPD
QM
EDDR
AIC
D9CM
NDD
FIC
PCIC
PC2P
MDD
BN
DFRT
SNO
MED
CTVA
NDF
CRIS
PRC
DM
EDLI
NE.
..SN
MI
COST
ART
WHO
-ART
Procedural Terminologies do not share CUIs!
CUIs shared
16-20 Terminologies
CUI Reuse
Nu
mb
er o
f Ter
ms
(Log
Sca
le)
ICD1
0PCS
HCPC
SN
CBIT
AXO
NLO
INC
MES
HHL
7IC
D10C
MO
MIM
RXN
ORM CP
TPD
QM
EDDR
AIC
D9CM
NDD
FIC
PCIC
PC2P
MDD
BN
DFRT
SNO
MED
CTVA
NDF
CRIS
PRC
DM
EDLI
NE.
..SN
MI
COST
ART
WHO
-ART
Procedural Terminologies do not share CUIs!
CUIs sharedCUI Reuse
Nu
mb
er o
f Ter
ms
(Log
Sca
le)
ICD1
0PCS
HCPC
SN
CBIT
AXO
NLO
INC
MES
HHL
7IC
D10C
MO
MIM
RXN
ORM CP
TPD
QM
EDDR
AIC
D9CM
NDD
FIC
PCIC
PC2P
MDD
BN
DFRT
SNO
MED
CTVA
NDF
CRIS
PRC
DM
EDLI
NE.
..SN
MI
COST
ART
WHO
-ART
Procedural Terminologies do not share CUIs!
CUIs sharedCUI Reuse
Nu
mb
er o
f Ter
ms
(Log
Sca
le)
Minimum sharing of CUIs, especially across UMLS Procedural Terminologies- ICD10PCS, HCPCS and CPT
Several unique terms introduced as we migrate from ICD9CM -> ICD10CM, leading to decrease in Term reuse.
Should there actually be Term Reuse?
Overlap decreases using correct representations!
14.4%(823621)
• Normalized String Matching on Term Labels
13.2%(752,176)
• Removing Explicitly Reused Terms
10.8%(617509)
• Removing Terms Mapped to the same UMLS CUI
1.6% (93,650)
• Removing almost-similar terms (same identifier and source ontology but different representation)
Average 3% Term reuse across ontologies using any method, yet a 14.4% naïve Term overlap!
Term overlap decreases substantially on removing almost similar terms …
Examples for almost similar terms?
Version 1.0/Version1.1 Subcellular Anatomy Ontology (SAO)
Suggested Ontology for Pharmacogenomics (SOPHARM)
Intent
Different Versions
BFO
NCIT
Different Notations
FMA
Different Namespaces
MESH
SNOMEDCT
Ontology Engineers show an intent for reuse!
Intent
Different Versions
BFO
NCIT
Different Notations
FMA
Different Namespaces
MESH
SNOMEDCT
NCIT:C53037/NCIT:Cerebral_VeinCigarette Smoke Exposure (CSEO)Sage Bionetworks Synapse (SYN)
Ontology Engineers show an intent for reuse!
OBO:FMA_31396OBO:owlapi/fma#FMA_31396
OBO:owl/FMA#FMA_31396OBO:fma#Cartilage_of_inferior_surface …
Ontology Engineers show an intent for reuse!
Intent
Different Versions
BFO
NCIT
Different Notations
FMA
Different Namespaces
MESH
SNOMEDCT
http://purl.bioontology.org/ontology/MESHhttp://phenomebrowser.net/ontologies/mesh/mesh.owl
Intent
Different Versions
BFO
NCIT
Different Notations
FMA
Different Namespaces
MESH
SNOMEDCT
Ontology Engineers show an intent for reuse!
Intent
Different Versions
BFO
NCIT
Different Notations
FMA
Different Namespaces
MESH
SNOMEDCT
http://ihtsdo.org/snomedct/http://purl.bioontology.org/ontology/SNOMEDCT
Ontology Engineers show an intent for reuse!
Different versions, notations, namespaces• >100% Reuse of few source ontologies• Increase in Term Overlap
Incorrect representations without mappings do not provide advantages of Term Reuse!
Key Findings
~3% Term Reuse Only popular or upper-
level ontologies reused 14.4% Term Overlap
Semantically-similar terms reused together
Similarity metric for a Recommender system
Onto 1 Onto 2 Onto 3 Onto 4 Onto 5 Onto 6 Onto 7
Term 1 1 1 1 0 0 0 0
Term 2 0 0 0 1 1 0 0
Term 3 0 0 0 0 0 1 1
Term 4 1 1 0 0 1 0 0
Term 5 1 1 1 0 0 0 1
Term 6 0 0 0 1 1 1 0
Term 7 0 0 1 0 1 0 0
Term-Ontology
Matrix
K-modes Clustering
Term-Term Affinity Matrix
Spectral Clustering
Understanding how Term Reuse Occurs
Term-Ontology
Matrix
K-modes Clustering
Term-Term Affinity Matrix
Spectral Clustering
Understanding how Term Reuse Occurs
Term-Ontology
Matrix
K-modes Clustering
Term-Term Affinity Matrix
Spectral Clustering
Understanding how Term Reuse Occurs
• Weighted Similarity Score between Term pairs– Shared Ontologies– Jaccard Semantic Similarity Score– CUI Hierarchy from UMLS Metathesaurus
Semantically-similar terms are reused together!
Semantic Similarity < 0.9
Cluster Size
Semantic Similarity > 0.9
Semantically-similar terms are reused together!Semantic Similarity > 0.9
Semantically-similar terms are reused together!Semantic Similarity > 0.9
Semantic-similar terms (Parent-child or siblings) are reused together …
Similarity Metric and BioPortal can be used to provide recommendations to ontology developers through a Web Protégé plugin!
Challenges to Term Reuse
• Substantial term overlap but less than 5% reuse.
• Lexically-similar terms may represent different concepts (e.g., anatomical concepts between ZFA and XAO).
• Lexically-different terms may represent same concepts (e.g. myocardium and cardiac muscle)
• Same terms use different IRI representations, and without explicit CUI or xref mappings.
• Lack of guidelines and semi-automated tools.
Future Work: WebProtégé Plugin
Term reuse recommendations using Item-based Collaborative Filtering method.Two-fold (A Posteriori and User-Centered) Evaluation
GO:0033036
GO:0008104
GO:1902432
GO:1903260
GO:0061472
GO:0090174
GO:0071850
GO:0044770
GO:0044839
GO:0045786
GO:0007050
GO:0044843
GO:1902969
GO:0036226
- Still far from achieving ideal term reuse, beyond upper level and popular ontologies
- Newer ontologies added in BioPortal- Without strict guidelines and semi-automated tools,
we will deviate more away …
The Road Ahead …
Acknowledgments
Musen Lab, StanfordBMI PhD Program, Stanford
US NIH Grants GM086587GM103316
http://stanford.edu/~maulikrk/data/OntologyReuse