Concept name similarity measure on SNOMED CT, Concept name ...ethesisarchive.library.tu.ac.th/thesis/2016/TU_2016_5822040902_6475_4523.pdf · 1.1 Text Similarity Measuring the similarity

Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM

CONCEPT NAME SIMILARITY MEASURE ON

SNOMED CT

BY

HTET HTET HTUN

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

(ENGINEERING AND TECHNOLOGY)SIRINDHORN INTERNATIONAL INSTITUTE OF TECHNOLOGY

THAMMASAT UNIVERSITY

ACADEMIC YEAR 2016

Ref. code: 25595822040902FWM


CONCEPT NAME SIMILARITY MEASURE ON

SNOMED CT

BY

HTET HTET HTUN

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

(ENGINEERING AND TECHNOLOGY)SIRINDHORN INTERNATIONAL INSTITUTE OF TECHNOLOGY

THAMMASAT UNIVERSITYACADEMIC YEAR 2016

Ref. code: 25595822040902FWM


Abstract

CONCEPT NAME SIMILARITY MEASURE ON SNOMED CT

by

HTET HTET HTUN

B.C.Sc. : , Bachelor of Computer Science, M.C.Sc.: Master of Computer Science, School of Information Computer and Communication Technology (ICT), University of Computer Studies (UCSM), Mandalay, Myanmar, 2016

Semantic similarity measure between concepts by exploiting medical

ontologies is a very essential task for exacting medical information and knowledge

discovery. One important application is a health decision support system that

recommends similar or alternative treatments between disease concepts from the

medical ontologies according to their similarity degrees. In the past, all of the existing

similarity measures estimate the similarity based on the taxonomical path length

between evaluated two concepts and the distance from the ontology hierarchy. But

taxonomic-based similarity measures cannot be accepted for all concepts in an

ontology because it includes “primitive concepts” that have limited amount of

informations and their definitions are not sufficiently distinguish from other concepts’

definitions in an ontology. Therefore, measuring the similarity based on the

taxonomical paths cannot give the desired similarity degrees for all ontology

concepts. For this fact, we proposed a new concept name similarity measure based on

semantic and syntactic similarities of the concept label. Our proposed measure is

mainly intended for primitive concept similarity. To examine the accuracy of our

proposed method, we calculate the correlation and error measurements against human

expert results. Moreover, we make the comparison between the results of our

proposed method and existing taxonomic-based similarity measures which got the ! ii

Ref. code: 25595822040902FWM


highest correlation values among the most other existing measures in the literature. As

a result, experiments showed that our method gets the highest correlation with human

expert and outperforms previous similarity measures. Additionally, experimental

results show that our proposed method is suitable for all types of ontology concepts -

defined concepts and primitive concepts.

Keywords: Concept Name Similarity Measure, Text Similarity, Natural Language

Processing, SNOMED CT, Semantic Similarity

! iii

Ref. code: 25595822040902FWM


Acknowledgements

I would like to express my sincere gratitude to my advisor, Dr. Virach

Sornlertlamvanich for his valuable advice, support, encouragement, kindness and

patience throughout my study.

My thanks also go to the committee members, Dr. Marut Buranarach and

Dr. Nguyen Duy Hung for their valuable comments, supports and guidance.

I also want to thank to all faculty members, seniors and my friends for

their encouragements, discussions and assistance during my studies.

I would like to acknowledge to my parents for their kindness, love and

valuable support, understanding and strength throughout my life.

Finally, grateful acknowledgement to Sirindhorn International Institute of

Technology (SIIT), Thammasat University (TU) for giving me the chance to get my

Second Master Degree.

! iv

Ref. code: 25595822040902FWM


Table of Contents

Chapter Title Page

Signature Page i

Abstract ii

Acknowledgements iv

Table of Contents v

List of Tables viii

List of Figures ix

1 Introduction 1

1.1 Text Similarity 1

1.1.1 WordNet 1

1.2 Biomedical Knowledge Sources and Ontologies 2

1.2.1 UMLS (Unified Medical Language System) 2

1.2.2 SNOMED CT (Systematized Nomenclature of

Medicine - Clinical Terms) 2

1.2.3 MeSH (Medical Subject Headings) 3

2 Literature Review 4

2.1 Text Similarity Measures 4

2.1.1 Unordered-based Text Similarity Measures 5

(i) Jaccard Similarity Coefficient 5

(ii) Cosine Similarity Coefficient 5

(iii) Szymkiewicz-Simpson Coefficient 5

(iv) Tversky Coefficient 5

(v) Difflib Similarity 6

! v

Ref. code: 25595822040902FWM


2.1.2 Ordered-based Text Similarity Measures 6

(i) Levenshtein Distance 6

2.2 Ontology-based Semantic Similarity Measures 8

2.2.1 Taxonomic-based Similarity Measures 8

(i) Leacock and Chodorow 8

(ii) Wu and Palmer 9

(iii) Choi and Kim 9

(iv) AI-Mubaid and Nguyen 9

(v) A New Path-based Similarity Measure 9

2.2.2 Description Logic ELH Semantic Similarity Measure (ELSIM) 10

3 Concept Name Similarity Measure on SNOMED CT 13

3.1 Concept Name Similarity Measure on SNOMED CT 15

3.1.1 Semantic Similarity (Linguistic Headword Structure) 16

3.1.2 Syntactic Similarity (Context-free Grammar) 18

3.1.3 Proposed Similarity Measure 19

4 Experimental Results and Discussion 20

4.1 Preliminary Experiment 20

4.2 Main Experiment on SNOMED CT 22

4.2.1 Experiments between Primitive Concepts 22

4.2.2 Experiments between Defined and Primitive Concepts 24

4.2.3 Experiments between Defined Concepts 27

4.3 Discussion 29

4.3.1 Limitations 30

5 Conclusions and Recommendations 32

! vi

Ref. code: 25595822040902FWM


References 33

Appendices 36

! vii

Ref. code: 25595822040902FWM


List of Tables

Tables Page

3.1 Incorrect similarity degree between primitive concepts using

existing two similarity measures 14

3.2 Different weights of concept P1 17

3.3 Different weights of concept P2 17

4.1 Results of similarity degrees for all categories of SNOMED CT

based on text similarity measures 21

4.2 Results of 30 pairs of concepts between primitive concepts estimated

by path-based, ELSIM, our proposed method and human expert 23

4.3 Results of 30 paris of concepts between primitive and defined concepts

estimated by path-based, ELSIM, our proposed method and human expert 25

4.4 Results of 30 paris of concepts between defined concepts estimated by

path-based, ELSIM, our proposed method and human expert 27

4.5 Correlation values between similarity measures and human expert

for each case 29

4.6 Error values between similarity measures and human expert for each case 29

4.7 Different similarity results between concepts using our proposed measure

with human expert results 30

4.8 Similarity degrees between concepts using our proposed measure

with human expert results 31

! viii

Ref. code: 25595822040902FWM


List of Figures

Figures Page

2.1 Text Similarity Measures 4

3.1 Overview system of concept name similarity measure on SNOMED CT 15

3.2 Notion of proposed similarity measure 15

! ix

Ref. code: 25595822040902FWM


Chapter 1Introduction

1.1 Text Similarity

Measuring the similarity between word pairs has been extensively studied in

many approaches such as natural language processing [1], machine translation, text

classification and summarization, query reformulation, knowledge acquisition and

information retrieval [2]. However, there are increasing number of tasks that require

computing the similarity between two strings. Generally, the notion of similarity is

often referred to as the lexical similarity based on total overlap between vocabularies

and common words. Additionally, similarity varies “semantic similarity” based on

their meaning or semantic content and “syntactic similarity” based on their string

format or syntactical representation. In a consequence, there has recently been

proposed various approaches for measuring concept similarity by using various

knowledge sources (ontologies, domain corpora, and thesauri etc.) [3]. Because

knowledge sources provide a structured, unambiguous representation and a formal of

conceptualization of knowledge. For general purpose thesaurus of the English

language has also been successfully applied for assessing word similarity.

1.1.1 WordNet

WordNet is a semantic lexical database for the English language developed at

Princeton University. In WordNet [4], nouns, adjectives, verbs and adverbs of English

are arranged into synonym sets (synsets) with short definitions (glosses). The synset

or concepts are connected with other synsets in the taxonomy by using various types

of relationship. The usual relationships are Hyponym/Hypernym (that means is-a

relation) and Meronym/Holonym (that means part-of relation).

However, measuring biomedical terms based on WordNet performs poorly

because of the restricted amount of specialized domain in the knowledge source.

There are many biomedical ontologies giving concept ids, terms, synonyms and

definitions used in clinical documentation and reporting in the biomedical field.

�1Ref. code: 25595822040902FWM


1.2 Biomedical Knowledge Sources and Ontologies

1.2.1 UMLS (Unified Medical Language System)

The UMLS was established by the United States National Library of Medicine

and it includes a set of files that describe health and biomedical vocabularies. UMLS

includes three domains (the Metathesaurus, the Semantic Network and the

SPECIALIST Lexicon) and software tools that able to access these knowledge

sources [5]. SNOMED CT is one of the Metathesaurus of the UMLS and a set of wide

categories for a consistent categorization of all concepts include in the Semantic

Network. The SPECIALIST Lexicon includes terms with linguistic information that

identify the domain of biomedical and healthcare system. UMLS is freely accessed

for the research purpose but a license will be needed.

1.2.2 SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms)

SNOMED CT is a standard biomedical terminology [6] supported by the

International Health Terminology Standards Development Organization (IHTSDO),

which validates the contents every 6 months. SNOMED CT covers all areas of

clinical information which organized into 18 top-level categories including body

structure, context-dependent, environment, event, finding, observable entity,

organism, physical force, physical object, procedure, product, qualifier value, social

concept, special concept, specimen, staging scale, substance and disease.

In SNOMED CT, concepts are organized in a hierarchy by using various

levels of specificity. There are 65 different relationship types and concepts are

connected by two main relations: “is-a” relation and “part of” relation [7]. Each

concept is uniquely identified with a concept ID (eg: id= 19036004), annotated with a

short textual description (eg: “rheumatic heart valve stenosis”) and equipped with a

definition. There are two kinds of concepts: defined concept and primitive concept in

the SNOMED CT ontology and contains 364,461 concept names which is the DL

version released in January 2005.

�2Ref. code: 25595822040902FWM


1.2.3 MeSH (Medical Subject Headings)

It is a hierarchy structure of biomedical concepts created by the United States

National Library of Medicine (NLM). It was introduced in 1960, with the NLM’s own

index catalogue [8]. MeSH terms are arranged in “is-a” hierarchy with more common

terms (eg: “chemicals and drugs”) higher in a hierarchy than more particular terms

(eg: “aspirin”). It includes 15 taxonomies with more than 22,000 terms (version 2004)

and each concept can occur in excess of one hierarchy. Each term is presented by

different features, the main descriptions are the MeSH Heading (MH), Scope Note

and Entry Terms. Each concept is identified by its MeSH code name that showing the

precise location of the term in a MeSH hierarchy.

�3Ref. code: 25595822040902FWM


Chapter 2 Literature Review

Over the years, many text similarity approaches have been proposed for

various applications. Basically, lexical similarity or surface-matching similarity

measures are primitive for the text similarity. Recently, there are several similarity

approaches by using different knowledge sources as their background ontologies.

Among them, some approaches have been modified to the medical research by

containing clinical information from biomedical ontologies such as SNOMED CT or

MeSH. In this chapter, we reviewed many previous works in both primitive text

similarity and semantic similarity measures based on medical ontologies.

2.1 Text Similarity Measures

Very Primitive and basic text similarity approaches can be categorized into

two types: unordered-based methods and order-based methods.

�4

Ordered-based

Tversky

Cosine

Levenshtein

Difflib

Simpson

Jaccard

Unordered-based

Text Similarity

Figure 2.1 Text Similarity Approaches

Ref. code: 25595822040902FWM


2.1.1 Unordered-based Text Similarity Measures

These similarity approaches do not consider the order of the words when

comparing the similarity. They compute the similarity on the total overlap of words

between the strings.

(i) Jaccard Similarity Coefficient

The Jaccard similarity [9] is defined as the ratio between the intersection and

union of the two sets as shown in Equation 2.1.

!

(ii) Cosine Similarity Coefficient

This is a measure of similarity [10] based on the two vectors of an inner

product space that computes the cosine degree between them. It is defined by using

the word vectors of a dot product and magnitude ||.|| as in Equation 2.2.

!

(iii) Szymkiewicz-Simpson Coefficient

This method finds the overlap between two strings as the ratio of the

cardinality of the intersection to the minimum between the cardinality of two sets

[11]. If one set is a subset of another set or the converse then the overlap coefficient is

equal to one.

!

(iv) Tversky Coefficient

It is an asymmetric similarity measure on sets [12]. The numerator represents

the commonality between two sets and the denominator represents the referent for

tsimJaccard (A,B) = | tset(A)∩ tset(B) || tset(A)∪ tset(B) |

(2.1)

tsimCosine(A,B) = tset(A). tset(B)|| tset(A) || || tset(B) ||

(2.2)

tsimSimpson (A,B) = | tset(A)∩ tset(B) |min(| tset(A) |,| tset(B) |)

(2.3)

�5Ref. code: 25595822040902FWM


comparison as in Equation 2.4. In Tversky index, ! gets the Jaccard similarity

and ! gets Dice similarity.

!

(v) Difflib Similarity

Difflib similarity is defined as the matching words (M) multiplied by 2 and the

total number of words (T) between both sets [13]. In Difflib, multiset is denoted by

tmset(.) to find the similarity. Number of matching words (M) is defined as the

cardinality of intersection of the multisets and T as follows.

!

!

!

2.1.2 Ordered-based Text Similarity Measures

They measure the similarity by taking not only the common words but also the

continuous data order of the strings. They get less similarity value between two

strings than unordered-based measures for the lexical similarity because they also

consider the ordering of the words for the similarity measure.

(i) Levenshtein Distance

It is the edit distance by taking smallest number of operations including

insertions, deletions and substitutions that require to convert the source string (s) to

the target string (t) [14]. It calculates based on matrix for measuring the difference

between two sequences as the following algorithm. In the matrix, the distance is in the

lower right hand corner of the matrix.

α , β = 1

α , β = 0.5

tsimTversky(A,B) = | tset(A)∩ tset(B) || tset(A)∩ tset(B) | +α | tset(A)− tset(B) | +β | tset(B)− tset(A) |

(2.4)

M =| tmset(A)∩ tmset(B) |

T =| tmset(A) | + | tmset(B) |

tsimDifflib (A,B) = 2 ×M /T (2.5)

�6Ref. code: 25595822040902FWM


Algorithm LevenshteinDistance (s,t)

Input: Two lists of words s,t

Output: distance d

Initialization:

!

!

Processing:

!

!

!

!

!

!

!

Termination:

D (n,m) is distance

To get the similarity from Levenshtein distance, it needs to be converted the

distance into normalization denoted by ! that is in the range of 0 and 1 as in

Equation 2.6. lendiff = difference of length of the two lists.

!

After getting ! , there are two ways to calculate the similarity:

!

!

len1← length(s)

len2← length(t)

1.D[ ][ ]← arrayof size len1× len2

for i←1 to len1 do

for each j←1 to len2 do

If s[i] equals t[ j], then cost = 0

else cost = 1

2. D[i, j]= min imum of :

i) D[i −1, j]+1ii) D[i, j −1]+1iii) D[i −1, j −1]+ cost

dnorm

dnorm (s,t) =d − lendiffmin(| s |, |t |)

(2.6)

dnorm

tsimLeven1(A,B) = 1− dnorm (tlist(A),tlist(B)) (2.7)

tsimLeven2 (A,B) = ( 11+ dnorm (tlist(A),tlist(B))

)× 2 −1 (2.8)

�7Ref. code: 25595822040902FWM


2.2 Ontology-based Semantic Similarity Measures

Recently, knowledge sources and ontologies are generally used for the

similarity research because they provide a structured and unambiguous representation

of concepts interconnected by semantic pointers. Generally, the basic idea to compute

concept similarity is based on the taxonomical structure such as minimum number of

path length between evaluated concepts. In this section, we discussed many

taxonomic-based similarity measures. Moreover, some ontologies are written in the

Description Logic ELH so we reviewed ELH semantic similarity measure for

SNOMED CT ontology.

2.2.1 Taxonomic-based Similarity Measures

Ontologies are directed graphs where concepts are connected mainly by means

of taxonomic (is-a) and other semantic links. Therefore, the basic idea to find concept

similarity is the taxonomic-based measure. In a taxonomy, the common way to

determine the distance between two concepts c1 and c2 is to calculate the shortest

path length connecting evaluated concepts [15].

!

(i) Leacock and Chodorow

This is a measure by taking the minimum path length between two concepts

denoted by Np from c1 to c2 including themselves and maximum depth D of the

ontology [16] and [17].

sim = log(Np/2D)

!

(ii) Wu and Palmer

It is a path-based measure by taking the depth of the two terms in the

taxonomy where N1 and N2 are the amount of “is-a” relations from concept c1 and c2

disPL (c1,c2 ) = min amount of taxonomical edges connecting c1 and c2 (2.9)

simL&C (c1,c2 ) = − log(Np / 2D) (2.10)

�8Ref. code: 25595822040902FWM


to their least common subsumer (LCS) and N3 is also the depth from LCS to the root

of ontology [16] and [18].

! (2.11)

(iii) Choi and Kim

This approach is also the taxonomic-based measure [19] based on the

difference in the levels of the depth for two concepts c1, c2 and the distance of the

minimum path between them as shown in Equation 2.12.

(iv) AI-Mubaid and Nguyen

This approach accounts the depth of the concept nodes and the path length

between them [20]. The method also takes the level of their least common subsumer

(lcs), and the distance of the minimum path of between them.

!

(v) A New Path-based Similarity Measure

This measure calculates the similarity based on the taxonomic paths

connecting the two concepts. It considers all of the ancestors connected to all the

taxonomic paths between concepts [21]. It is based on the idea that pairs of concepts

connected to an upper level of the hierarchy (i.e., they share few ancestors) and it’s

similarity degree should be less than the pairs of concepts in a under level because

they share more ancestors. It calculates the similarity between concept c1 and c2

simW&P (c1,c2 ) =2 × N3

N1 + N2 + 2 × N3

sim(c1,c2 ) = log2( [ l(c1,c2 )−1]× [D − depth(lcs(c1,c2 )) ]+ 2) (2.13)

�9

simCK (c1,c2 ) =MAX _PATH − path(c1,c2 )

MAX _PATH× MAX _ LEVEL − diff _ level(c1,c2 )

MAX _ LEVEL(2.12)

Ref. code: 25595822040902FWM


based on the amount of non-shared knowledge and all of the shared and non-shared

knowledge, and it takes the inverted logarithm function as shown in Equation 2.14.

!

L e t s e t t h e f u l l t a x o n o m y ! o f c o n c e p t s ( C ) o f a n o n t o l o g y,

! is the union of the ancestors of the

concept ! itself.

2.2.2 Description Logic ELH Semantic Similarity Measure (ELSIM)

In Description Logics (DLs), concept descriptions are defined with a set of

constructors, a set of concept names CN and role names RN. The set of concept

definitions for a specific DL ELH is denoted by Con(ELH) [22]. The set Con(ELH)

can be defined as follow:

!

in which T denotes the top concept, ! A is concept name (CN) and r

is role name (RN). In DL, concept names appearing on the left hand side of a

definition are denoted by “defined concept names” ( ! ). Other concept names are

called “primitive concept names” ( ! ). Therefore, ! .

ELSIM measure determines the similarity by using structural characterization

of two concepts by constructing the description trees. It first constructs description

tree for each concept from Top to evaluated concept using Algorithm 1 as the

following.

sim(c1,c2 ) = − log2|T (c1)∪T (c2 ) | − |T (c1)∩T (c2 ) |

|T (c1)∪T (c2 ) |(2.14)

H c

T (ci ) = {cj ∈C | cj is superconcept of ci}∪{ci}

ci and ci

C,D→ A |T |C ∩D | ∃r.C

C,D∈Con(ELH ),

CNdef

CN pri CN = CN pri ∪CNdef

�10Ref. code: 25595822040902FWM


Algorithm 1 ELH description tree

!

where !

Input :ΡC and εCOutput :ThedescriptiontreeTFunctionbuild − tree(ΡC ,εC )1.CreateanewtreeT2.Createanewvertex v∈V3. L(v)←ΡC

4. for each∃r.C '∈εC do5.build − child − node(u,r,ΡC ' ,εC ' )6.returnTfunctionbuild − child − node(u,r,ΡC ,εC )1.Createanewvertexw∈V2. L(w)∈←ΡC

3. Add anewedge(v,w)toE4. ρ(v,w)← {r}5. for each∃s.C '∈εC do6. build − child − node(w, s,ΡC ' ,εC ' )

0 ≤ µ ≤1;

�11

After constructing the description tree, they compute the similarity based on the

homomorphism tree function as the following.

Description (Homomorphism degree)

Let define ELH description trees that correspond to two ELH concept names

C and D, respectively. The homomorphism degree function is inductively defined as

follows: hd(ΤD ,ΤC ) = µ.ρ − hd(ΡD ,ΡC )+ (1− µ).e− set − hd(εD ,εC )

ρ − hd(ΡC,ΡD ) :=1 if ΡC =∅|ΡC ∩ΡD ||ΡC |

otherwise,

⎧

⎨⎪

⎩⎪

Ref. code: 25595822040902FWM


where |.| represents the set cardinality;

!

where ! are existential restrictions and

! !

where ! and !

The ELH similarity degree between C and D is determined as follows:

!

The implementation of this measure is available from this website (http://

ict.siit.tu.ac.th). This measure is constructed using a specific language Description

Logic ELH so it fulfills only the requirements of written language.

e− set − hd(εC ,εD ) :=1 if εC =∅0 if εC ≠ andεD =∅

∈i∈εC∑ max{e− hd(∈i ,∈j ) :∈j∈εD}| εC |

otherwise,

⎧

⎨

⎪⎪⎪

⎩

⎪⎪⎪

∈i ,∈j

e− hd(∃γ .X,∃s.Y ) := γ (v + (1− v).hd(ΤX ,ΤY ))

γ = |ℜr ∩ℜs ||ℜr |

0 ≤ v ≤1.

sim(C,D) = hd(ΤC ,ΤD )+ hd(ΤD ,ΤC )2

�12Ref. code: 25595822040902FWM


Chapter 3Concept Name Similarity Measure on SNOMED CT

In the previous chapter, all of the existing similarity measures find the

similarity based on the structural characterization of the ontology. But there are

different types of concepts in the ontology - defined concepts and primitive concepts.

Defined Concepts

They are fully defined in the ontology which also have at least one

relationship to another concept and their definitions are sufficiently defined to

distinguish from other concepts.

For example,

“Hypoxia of brain”

Is a = hypoxia

Finding site = brain structure

Sufficiently Defined

“Hypoxia of brain” has “is-a” relation with “hypoxia” and also has “attribute-

value” relationship type “finding site” with another concept “brain structure”.

Therefore, this concept has specific and complete information in order to sufficiently

distinguish from other concepts.

Primitive Concepts

They are partially defined in the ontology because their definitions are not

sufficiently distinguished from other concepts because they are actually needed to

define with additional information.

For example,

“Tumor of dermis”

Is a = navigational concept

Primitive

�13Ref. code: 25595822040902FWM


“Tumor of dermis” has “is-a” relation with “navigational concept” but it does

not have complete information about itself. Therefore, ontology builders call them as

“primitive concepts” and they always redefine these concepts with more complete and

specific information from the actual medical treatment records. For these reasons,

there has one interesting point whether existing taxonomic-based similarity measures

give the correct similarity degree between all types of ontology concepts or not. So,

we test some pairs of primitive concepts from SNOMED CT ontology using existing

two taxonomic-based measures (1. Path-based measure in section 2.2.1.5 which got

the highest correlation value with human expert result among most of the existing

taxonomic-based similarity measures 2. ELSIM in section 2.2.2 which is the

Description logic ELH semantic similarity measure) and then compare the results

from human experts as the following Table 3.1.

According to the Table 3.1, existing measures cannot give desired similarity

degrees for the primitive concepts. In a consequence, we intend to propose concept

name similarity measure mainly for the primitive concept similarity on SNOMED CT.

Figure 3.1 shows our overview system to find the similarity degrees using our

proposed measure on SNOMED CT ontology.

Primitive Concept P1 Primitive Concept P2 Path-based ELSIM human

resultInfiltrative lung

tuberculosis

Nodular lung

tuberculosis

0.2 0.0 0.7

maternal autoimmune

hemolytic anemia

autoimmune

hemolytic anemia

0.2 0.0 0.8

phakic corneal edema Corneal epithelial

edema

0.2 0.0 0.5

�14

Table 3.1 Incorrect similarity degree between primitive concepts using existing two similarity measures

Ref. code: 25595822040902FWM


3.1 Concept Name Similarity Measure on SNOMED CT

In SNOMED CT, each ontology concept is uniquely identified by a concept

ID (e.g. id=10365005), annotated with a short textual description (e.g. “right main

coronary artery thrombosis”) and equipped with a definition in description logic.

Moreover, ontology concept names are taken from the actual patient medical health

records so they are very informative and can demonstrate the complete meaning of the

concept.

�15

Concept nameSemantic similarity (based on headword)

Syntactic similarity (Context-free grammar)

Figure 3.2 Notion of proposed similarity measure

Experiments between three cases

Primitive concepts

Primitive concepts and defined concepts

Defined concepts

Proposed Similarity Measure

Similarity Results

Figure 3.1 Overview system of concept name similarity measure on SNOMED CT

SNOMED CT

Ref. code: 25595822040902FWM


3.1.1 Semantic Similarity (Linguistic Headword Structure)

All concept names are expressed in the form of noun phrase, in which the

“headword” holds the core meaning of the phrase and we cannot omit the headword

in noun phrase. Therefore we consider the highest weight for the headword when

comparing the similarity of two concept names. In English language, the structure of

noun phrases can be described as in the following cases.

1. Determiner + Pre-modifier + noun (headword)

2. noun (headword) + Post-modifier/ complement

3. noun +noun

All of the SNOMED CT concept names perform as the first case. Therefore,

the rightmost noun is the headword of the concept name. We made some experiments

by giving different weights to each component of concept name according to the

analysis of noun phrase structure. After some experiments, we conclude that the

suitable weight for the headword is 0.6, and 0.4 is for the remaining components. For

the calculation of data, let’s consider following two concepts,

P1 = “right main coronary artery thrombosis” and

P2 = “superior mesenteric vein thrombosis”.

For concept P1,

• Weight for headword “thrombosis” is 0.6

• Weight for remaining components is 0.4 (0.1 for each remaining component)

Firstly, we give equal weights to each remaining component. As the idea of

nearer components from the headword have higher semantic influence on the

headword [23], nearer components should get higher weights than other components.

For this fact, we consider positions of the components and assign the weight for each

component based on the distance from the headword. Therefore, the weight of each

component is divided by the distance value. For the nearest component from the

headword, we subtract the sum of all other remaining components from 0.4. So, the

sum of all weights of concept name is 1. As a result, the weight can be distributively

estimated as shown in Table 3.2 and 3.3.

�16Ref. code: 25595822040902FWM


We apply the Jaccard similarity for headword similarity denoted by

! .

! !

There are two points that we need to consider for this semantic similarity.

1. Some words are lexically same but they have different meanings.

For two examples, “kidney parenchyma” and “kidney beans”

• “kidney parenchyma” is human tissue of kidney and “kidney beans” is about

a kind of bean.

• This case cannot occur as we compute the similarity based on the same

category (for disease category, all the concepts are about health such as

illness, sickness).

right main coronary artery thrombosis

0.1 0.1 0.1 0.1 0.6

0.1/4=0.025 0.1/3=0.033 0.1/2=0.05 0.4-(0.025+0.033+0.05)=0.292

0.6

superior mesenteric vein thrombosis

0.133 0.133 0.133 0.6

0.133/3=0.044 0.133/2=0.067 0.4-(0.044+0.067)=0.289

0.6

simHeadword

simHeadword (P1,P2 ) =| tset(P1)∩ tset(P2 ) || tset(P1)∪ tset(P2 ) |

= 0.6(0.025 + 0.033+ 0.05 + 0.292 + 0.6 + 0.044 + 0.067 + 0.289)

= 0.43

�17

Table 3.2 Different weights of concept P1

Table 3.3 Different weights of concept P2

Ref. code: 25595822040902FWM


2. Some words are lexically different but they have the same meaning.

For two examples, illness and sickness.

• To complete this requirement, we used WordNet ontology to calculate the

synsets similarity ! because two concepts are similar if their synsets are

lexically similar [24] as the Equation 3.1.

! (3.1)

A is the synset of concept ! and B is the synset of concept ! .

• For this reason, we apply the synset similarity calculation to only the two

important headwords. If the degree of similarity of two snysets is greater than

0, then the two words are considered to be the same. Otherwise, they are

different.

!

3.1.2 Syntactic Similarity (Context-free Grammar)

According to english noun phrase construction, we can also decide the

similarity from the syntactic structure. In order to know the syntactic structure of

noun phrases, we apply the context-free grammar (CFG) [25]. The grammar G = (T,

N, S, R).

• T is set of terminals • N is set of non-terminals (NP in this case) • S is the starting symbol • R is rules or productions of the form

We create noun phrase rules that cover all types of concept names in

SNOMED CT as listed in the following.

1. NP ! N

2. NP ! N NP

3. NP ! Adj NP

4. NP ! Det NP

Ssynset

Ssynset (P1,P2 ) =| A∩ B || A∪ B |

P1 P2

Sim(P1,P2 ) =1, if Ssynset (P1,P2 ) > 0

0 if Ssynset (P1,P2 ) = 0

⎧⎨⎪

⎩⎪

→

→

→

→

�18Ref. code: 25595822040902FWM


5. NP ! Adv NP

After applying CFG rule, the parsing orders of ! and ! from the previous

section are shown as the following.

• Parsing order of ! : 3-3-3-2-1

• Parsing order of ! : 3-3-2-1

Syntactic similarity is estimated by applying CFG parsing rule. For the

similarity calculation, nominator is the intersection of rules and denominator is the

maximum number of rules.

!

= 0.8

3.1.3 Proposed Similarity Measure

After getting similarity values from two dimensions: semantic and syntactic

structure, we consider finalize similarity value by assigning different weights based

on their generalizations. If two concepts have exactly the same syntactic structure, but

different headword terms, they have different meanings. Headword structure has

higher accurate influence for the similarity degree according to their headword

position. This means that headword structure decides the similarity more effective

than syntactic structure. Therefore, we decide to set different weights as 0.7 for

headword structure and 0.3 for syntactic structure.

!

→

P1 P2

P1

P2

simCFG (P1,P2 ) =45

Wsim(P1,P2 ) = a × simHeadword (P1,P2 )+ b × simCFG (P1,P2 )= 0.7 × 0.43+ 0.3× 0.8= 0.54

�19Ref. code: 25595822040902FWM


Chapter 4 Experimental Results and Discussion

In this experiment, there are two parts:

(1) preliminary experiment with text similarity approaches (in section 2.1) in order to

know the general and overview similarity degree of each category of SNOMED

CT and

(2) main experiment using proposed method.

4.1 Preliminary Experiment

For this experiment, we use SNOMED CT which is the DL version released in

January 2005 that contains 364,461 concept names [26]. There are 18 top-level

categories in the ontology. We pick up 50 concepts from each category and generate

20825 concept pairs by considering only the distinct pairs (i.e., not include P1= P2)

and calculate the similarity degrees using 6 different text similarity measures. For the

Levenshtein distance, we apply two different kinds of similarity, Leven1 and Leven2.

For the results of 20825 pairs, we show the average and maximum value from each

category as shown in Table 4.1.

According to the Table 4.1, we conclude about 73% of the pairs are totally

dissimilar (i.e., zero value for similarity) among 20825 pairs by applying five

unordered-based measures based on the average of concepts. For the ordered-based

measure, Levenshtein distance gives 443 pairs of getting zero value more than

unordered-based measures because it also considers the ordering of words when

compare the similarity. For the performance, average execution time of each method

requires 1.54 seconds. If we compute all concepts in SNOMED CT, it will take about

38 days for all total number of distinct pairs as it has 364461 concepts. By doing the

preliminary experiments, we notice that all of the concept names are noun phrases and

there has the most important noun called “headword” that holds the core meaning of

the noun phrase [23]. Therefore, we make the main experiment using our proposed

measure as shown in the next section.

�20Ref. code: 25595822040902FWM


�21

Similarity Measures (avg/ max)Category Jaccard Cosine Tversky Simpson Difflib Leven 1 Leven 2 AverageBody Structure

0.10/0.71 0.16/0.83 0.16/0.83 0.19/0.83 0.16/0.83 0.15/0.83 0.1/0.71 0.15/1.0

Context-dependent

0.03/0.67 0.05/0.82 0.05/0.86 0.06/1.0 0.05/0.8 0.06/1.0 0.04/1.0 0.05/1.0

Environment 0.05/0.83 0.07/0.91 0.07/0.91 0.08/1.0 0.07/0.91 0.08/1.0 0.06/1.0 0.07/1.0

Event 0.15/0.93 0.22/0.97 0.21/0.97 0.3/1.0 0.21/0.94 0.3/1.0 0.23/1.0 0.23/1.0Finding 0.03/0.75 0.04/0.87 0.04/0.92 0.05/1.0 0.04/0.86 0.04/1.0 0.03/1.0 0.04/1.0

Observable Entity

0.06/0.83 0.1/0.91 0.09/0.91 0.12/1.0 0.09/0.91 0.11/1.0 0.07/1.0 0.09/1.0

Organism 0.01/0.5 0.01/0.71 0.01/0.67 0.01/1.0 0.01/0.67 0.01/1.0 0.01/1.0 0.01/1.0

Physical Force

0.12/0.8 0.18/0.89 0.18/0.89 0.2/1.0 0.18/0.89 0.19/1.0 0.14/1.0 0.17/1.0

Physical Object

0.25/0.8 0.39/0.89 0.38/0.89 0.46/1.0 0.38/0.89 0.46/1.0 0.31/1.0 0.37/1.0

Procedure 0.13/0.75 0.19/0.87 0.18/0.86 0.2/1.0 0.18/0.86 0.2/1.0 0.14/1.0 0.17/1.0

Product 0.01/0.67 0.02/0.82 0.02/0.8 0.02/1.0 0.02/0.8 0.02/1.0 0.02/1.0 0.02/1.0

Qualifier Value

0.04/0.75 0.05/0.87 0.05/0.86 0.05/1.0 0.05/0.86 0.05/1.0 0.04/1.0 0.04/1.0

Social Concept

0.03/0.8 0.04/0.89 0.04/0.89 0.05/1.0 0.04/0.89 0.05/1.0 0.04/1.0 0.04/1.0

Special Concept

0.03/0.8 0.04/0.89 0.04/0.89 0.04/1.0 0.04/0.89 0.04/1.0 0.03/1.0 0.04/1.0

Specimen 0.29/0.8 0.42/0.89 0.41/0.89 0.47/1.0 0.41/0.89 0.44/1.0 0.32/1.0 0.39/1.0

Staging Scale 0.16/0.8 0.24/0.89 0.23/0.89 0.27/1.0 0.23/0.8 0.21/1.0 0.16/1.0 0.21/1.0

Substance 0.002/0.6 0.003/0.8 0.003/0.8 0.003/0.8 0.003/0.8 0.003/0.8 0.002/0.6 0.003/1.0

Concept Average

0.09/0.93 0.13/0.97 0.13/0.97 0.16/1.0 0.13/0.94 0.15/1.0 0.11/1.0 0.13/1.0

Roles 0.02/0.5 0.03/0.71 0.03/0.67 0.04/1.0 0.03/0.67 0.03/1.0 0.02/1.0 0.03/1.0

Table 4.1 Results of similarity degrees for all categories of SNOMED CT based on text similarity measures

Ref. code: 25595822040902FWM


4.2 Main Experiment on SNOMED CT

In the SNOMED CT, there are two different kinds of concepts - defined

concept and primitive concept. Therefore, we make three different types of

experiment between (1) primitive concepts, (2) primitive and defined concepts, and

(3) defined concepts. From SNOMED CT disorder category, we pick up 30 pairs of

concepts for each type, therefore, total is 90 pairs of concepts for three different types

of experiment.

One of the usual way in order to prove the outperformance of the proposed

method is to compare with existing measures so we chose the path-based measure

(section 2.2.1.5) which got the highest correlation value among most of the existing

similarity measures and description logic ELH semantic similarity measure (section

2.2.2) because SNOMED CT is written in description logic. To compute the result of

proposed method with existing two similarity measures, we make the implementation

of these two similarity measures.

To examine the validity of all measures, we requested the similarity results for

90 pairs of concepts from five medical doctors. Therefore, they make a consensus on

the degree of similarity of the concepts and we calculate the correlation values

between the results from all measures and medical doctors.

4.2.1 Experiments between Primitive Concepts

The first experiment is between primitive concepts and our proposed method

is mainly intended for the primitive concepts. As primitive concepts do not have full

relationship or definitions in the ontology hierarchy, our proposed concept name

similarity measure from the natural language processing views is the best similarity

measure rather than existing taxonomic-based measures. The results of primitive

concepts estimated by path-based, ELSIM, our proposed measure and human expert

are shown in Table 4.2.

�22Ref. code: 25595822040902FWM


Primitive Concept P1

Primitive Concept P2 Path-based

ELSIM Proposed measure

Human expert

Hormonal tumor Malignant mast cell tumor

0.2 0.0 0.5 0.6

Maternal autoimmune hemolytic anemia

Autoimmune hemolytic anemia

0.2 0.0 0.8 0.8

Hypertensive leg ulcer

Solitary anal ulcer 0.3 0.7 0.5 0.4

Bovine viral diarrhea Bovine coronoviral diarrhea

0.6 0.6 0.7 0.7

Acute uterine inflammatory disease

Mycoplasmal pelvic inflammatory disease

0.4 0.2 0.9 0.9

Primary cutaneous blastomycosis

Primary pulmonary blastomycosis

0.7 0.9 0.7 0.6

Iodine-deficiency-related multinodular endemic goiter

Non-toxic multi nodular goiter

0.8 0.7 0.8 0.8

Congenital pharyngeal polyp

Uterine cornual polyp

0.4 0.6 0.5 0.5

Phakic corneal edema

Corneal epithelial edema

0.2 0.0 0.5 0.5

Knee pyogenic arthritis

Gonococcal arthritis dermatitis syndrome

0.9 0.8 0.4 0.4

Hereditary canine spinal muscular atrophy

Spinal cord concussion

0.5 0.7 0.3 0.5

Mite-borne hemorrhagic fever

Meningococcal cerebrospinal fever

0.4 0.5 0.6 0.5

Congenital cleft larynx

Congenital spastic foot

0.6 0.8 0.3 0.3

Congenital acetabular dysplasia

Short rib dysplasia 0.5 0.9 0.5 0.5

Intestinal polyposis syndrome

Ovarian vein syndrome

0.6 0.8 0.6 0.5

�23

Table 4.2 Results of 30 pairs of concepts between primitive concepts estimated by path-based, ELSIM, our proposed method and human expert

Ref. code: 25595822040902FWM


4.2.2 Experiments between Defined and Primitive Concepts

As the primitive concepts are not fully defined in the ontology, similarity

measure between primitive and defined concepts is also interesting point in the

SNOMED CT. Existing taxonomic-based approaches can not give desired similarity

degrees estimated by human experts because primitive concepts have partially defined

Extrapulmonary subpleural pulmonary sequestration

Pulmonary alveolar proteinosis

0.7 0.6 0.4 0.4

Atypical chest pain Psychogenic back pain

0.3 0.1 0.5 0.5

Puerperal pelvic cellulitis

Chronic female pelvic cellulitis

0.9 0.7 0.8 0.7

Spinal cord hypoplasia

Spinal cord rupture 0.5 0.7 0.6 0.6

Infiltrative lung tuberculosis

Nodular lung tuberculosis

0.2 0.0 0.9 0.7

Early gastric cancer Primary vulval cancer 0.4 0.8 0.4 0.4

Congenital mesocolic hernia

Gangrenous epigastric hernia

0.2 0.0 0.5 0.4

Congenital nonspherocytic hemolytic anemia

Congenital macular corneal dystrophy

0.2 0.0 0.3 0.2

Congenital cerebellar cortical atrophy

Congenital renal atrophy

0.6 0.9 0.7 0.2

Puerperal pyrexia Heat pyrexia 0.3 0.0 0.5 0.6

Methylmalonyl-CoA mutase deficiency

Muscle phosphoglycerate mutase deficiency

0.2 0.0 0.7 0.5

Recurrent mouth ulcers

Multiple gastric ulcers

0.5 0.8 0.4 0.4

Infantile breast hypertrophy

Sebaceous gland hypertrophy

0.6 0.7 0.6 0.4

Congenital pyloric hypertrophy

Synovial hypertrophy 0.3 0.0 0.5 0.3

Inflammatory testicular mass

Inflammatory epidermal nevus

0.5 0.7 0.3 0.3

�24Ref. code: 25595822040902FWM


information and defined concepts have fully defined information. Therefore,

similarity degrees between two concepts can be very low or high because of the

incomplete amount of information for the primitive concepts. For this reason,

taxonomic-based approaches are also not acceptable for the defined and primitive

concept similarity.

Primitive Concept P1

Defined Concept P2 Path-based

ELSIM Proposed method

Human expert

Mosquito-borne hemorrhagic fever

Glandular fever pharyngitis

0.4 0.7 0.5 0.5

Right main coronary artery thrombosis

Coronary artery rupture

0.9 0.9 0.5 0.4

right main coronary artery thrombosis

superior mesenteric vein thrombosis

0.7 0.9 0.5 0.5

Infectious mononucleosis hepatitis

chronic alcoholic hepatitis

0.2 0.0 0.5 0.5

Cerebral venous sinus thrombosis

Phlebitis cavernous sinus

1.0 0.9 0.6 0.6

Third degree perineal laceration

Complex periorbital laceration

0.3 0.7 0.5 0.5

Congenital subaortic stenosis

Rheumatic aortic stenosis

0.9 0.7 0.6 0.7

Congenital acetabular dysplasia

Aortic valve dysplasia

0.5 0.6 0.5 0.3


Fetal cytomegalovirus syndrome

0.4 0.4 0.6 0.3

Anterior choroidal artery syndrome

Juvenile polyposis syndrome

0.4 0.7 0.5 0.3

Puerperal pelvic cellulitis

Streptococcal cellulitis

0.3 0.5 0.5 0.3

Benign hypertensive renal disease

Pulmonary hypertensive venous disease

0.7 0.8 0.6 0.4

�25

Table 4.3 Results of 30 pairs of concepts between primitive and defined concepts estimated by path-based, ELSIM, our proposed method and human expert

Ref. code: 25595822040902FWM


Corneal epithelial edema

Idiopathic corneal edema

0.1 0.0 0.8 0.6

Chronic sarcoid myopathy

Hereditary hollow viscus myopathy

0.3 0.6 0.5 0.5

Primary cutaneous blastomycosis

Chronic pulmonary blastomycosis

0.7 0.9 0.6 0.6

Gingival pregnancy tumor

Granular cell tumor 0.4 0.5 0.6 0.4

Borderline epithelial tumor

Melanotic malignant nerve sheath tumor

0.4 0.6 0.4 0.4

Congenital sternomastoid tumor

Malignant mast cell tumor

0.4 0.5 0.4 0.4

Congenital pharyngeal polyp

Rhinosporidial mucosal polyp

0.4 0.4 0.6 0.5

Mercurial diuretic poisoning

Lobelia species poisoning

0.4 0.4 0.4 0.5

Branch macular artery occlusion

Acute mesenteric arterial occlusion

0.5 0.9 0.5 0.6

Intrarenal hematoma

Stomach hematoma 0.5 0.9 0.5 0.6

Spinal cord hypoplasia

Spinal cord dysplasia

0.9 0.9 0.6 0.6

Coronary artery thrombosis

Vertebral artery thrombosis

0.2 0.0 0.9 0.6

Duodenal papillary stenosis

Congenital bronchial stenosis

0.5 0.6 0.6 0.4

Arteriovenous fistula stenosis

Subclavian vein stenosis

0.4 0.2 0.6 0.5

Mechanical hemolytic anemia

Hereditary sideroblastic anemia

0.7 0.7 0.6 0.5

Malignant catarrhal fever

Malignant lipomatous tumor

0.7 0.2 0.3 0.3

Bolivian hemorrhagic fever

Dengue hemorrhagic fever

0.6 0.9 0.8 0.6

Benign brain tumor Benign neuroendocrine tumor

0.4 0.0 0.5 0.5

�26Ref. code: 25595822040902FWM


4.2.3 Experiments between Defined Concepts

Defined concepts are completely defined and their definitions are sufficiently

defined in the ontology but there is no guarantee for all defined concepts as their

informations are fully satisfied from the actual medical treatment records. Therefore,

this type of experiment is also interesting to check that existing taxonomic-based

approaches give desired similarity degrees for defined concepts. Table 4.4 is shown

the results between defined concepts computed by path-based, ELSIM, our proposed

method and human expert.

Defined Concept P1 Defined Concept P2 Path-based

ELSIM Proposed method

Human expert

Rheumatic heart valve stenosis

Coronary artery stenosis

0.6 0.8 0.5 0.6

Nasal septal hematoma

Vocal cord hematoma

0.3 0.9 0.5 0.5

Simple periorbital laceration

Brain stem laceration

0.5 0.9 0.4 0.5

Peritonsillar cellulitis Dentoalveolar cellulitis

0.5 0.9 0.6 0.5

Parainfluenza virus laryngotracheitis

Acute viral laryngotracheitis

1.0 0.9 0.4 0.6

Bone marrow hyperplasia

Retromolar gingival hyperplasia

0.8 0.8 0.5 0.4

Chronic proctocolitis

Chronic viral hepatitis

0.5 0.8 0.4 0.3

Obstructive biliary cirrhosis

Syphilitic portal cirrhosis

0.6 0.8 0.5 0.6

Peripheral T-cell lymphoma

Primary cerebral lymphoma

0.5 0.8 0.5 0.6

Mast cell leukemia

Prolymphocytic leukemia

0.9 0.9 0.4 0.5

Tricuspid valve regurgitation

Rheumatic mitral regurgitation

0.9 0.9 0.5 0.6

�27

Table 4.4 Results of 30 pairs of concepts between defined concepts estimated by path-based, ELSIM, our proposed method and human expert

Ref. code: 25595822040902FWM


Gangrenous paraesophageal hernia

Congenital bladder hernia &

0.5 0.8 0.6 0.5

Congenital mandibular hyperplasia

Atypical endometrial hyperplasia

0.3 0.6 0.5 0.5

Tuberculous adenitis

Acute mesenteric adenitis

0.7 0.6 0.5 0.4

Congenital skeletal dysplasia

Aortic valve dysplasia

0.4 0.7 0.5 0.4

Histiocytic sarcoma Alveolar soft part sarcoma

1.0 0.9 0.5 0.6

Drug-induced ulceration

Amebic perianal ulceration

0.9 0.6 0.5 0.5

Cervical radiculitis Cervical lymphadenitis

0.4 0.8 0.7 0.6

Basilar artery embolism

Obstetric pulmonary embolism

0.6 0.7 0.5 0.6

Acute apical abscess

Chronic apical abscess

0.5 1.0 0.9 0.7

Acute glossitis Chronic glossitis 0.4 0.4 0.5 0.7

Acute bronchitis Acute purulent meningitis

0.3 0.6 0.4 0.4

Acute lower gastrointestinal hemorrhage

Stromal corneal hemorrhage

0.4 0.8 0.4 0.4

Epidural hemorrhage

Tracheostomy hemorrhage

0.6 0.7 0.5 0.4

Thallium sulfate toxicity

Ammonium sulfamate toxicity

1.0 0.6 0.6 0.5

Simple periorbital laceration

Complex periorbital laceration

0.8 1.0 0.8 0.6

Biceps femoris tendinitis

Profunda femoris artery thrombosis

0.5 0.8 0.2 0.4

Hyperplastic thrush Hyperplastic gingivitis

0.3 0.5 0.7 0.5

Acer rubrum poisoning

Penicillium rubrum toxicosis

0.5 0.6 0.6 0.6

Acute vesicular dermatitis

Herpesviral vesicular dermatitis

0.4 0.5 0.9 0.6

�28Ref. code: 25595822040902FWM


After getting the similarity results using all approaches based on three cases,

we calculate the correlation and error values against human expert results as shown in

Table 4.5 and 4.6.

4.5 Discussion

Corresponding to the correlation values in Table 4.5, it is clear that existing

taxonomic-based similarity approaches cannot give the desired similarity results for

the ontology concepts. In the case of primitive concepts similarity, the first existing

approach (Path-based) gets very few correlation value (0.04) so there is only 4 %

relation between these two results. The second existing approach (ELSIM) gets the

negative correlation, it means that these two results are totally different. When its

result is high, the human result is low, vice versa. Therefore, similarity measures

Method Method Tpye Primitive Concepts

Primitive and Defined Concepts

Defined Concepts

Path-based ontology-based 0.04 0.2 0.2

ELSIM ontology-based -0.19 0.18 0.03

Proposed measure

concept name 0.84 0.5 0.51

Method Method Tpye Primitive Concepts

Primitive and Defined Concepts

Defined Concepts

Path-based ontology-based 0.1 0.1 0.1

ELSIM ontology-based 0.2 0.1 0.1

Proposed measure

concept name 0.02 0.02 0.02

�29

Table 4.5 Correlation values between similarity measures and human expert for each case

Table 4.6 Error values between similarity measures and human expert for each case

Ref. code: 25595822040902FWM


based on taxonomical paths are not acceptable for the primitive concept similarity.

Our proposed method gets the highest correlation value (0.74) and lowest error value

(0.02) so our proposal outperforms the existing approaches for the primitive concept

similarity. Moreover, our proposed method gets the highest correlations in another

two cases (between primitive and defined concepts, and between defined concepts). In

a consequence, we knew these two important points: the first one is about even

defined concepts in the ontology need more complete information from actual

medical treatment records and the second one is even our proposed measure mainly

intends for primitive concepts, it can determine the most desired similarity degrees for

all three cases.

4.5.1 Limitations

According to the analysis of the results, our proposed measure has some

limitations. Although the headword holds the core meaning of each concept, the

headword in some concept names represents the general meaning (eg: syndrome,

atrophy). This is the only case that using the important of the headword cannot

effectively distinguish the similarity degree as shown in Table 4.7.

Although the headword has the common meaning (e.g., disease, fever) in

some concepts, our proposed method gets the desired similarity degrees with human

expert results because our approach assigns the second highest weight to the nearest

component from the headword. Therefore, the similarity results of our proposed

Concept P1 Concept P2 Proposed method Human expert


Fetal cytomegalovirus syndrome

0.6 0.3

Congenital cerebellar cortical atrophy

Congenital renal atrophy

0.7 0.2

�30

Table 4.7 Different similarity results between concepts using our proposed measure with human expert results

Ref. code: 25595822040902FWM


method are not very different with human results even if the headword holds the

general meaning as shown in Table 4.8.

Concept P1 Concept P2 Proposed method Human expert

Acute uterine inflammatory disease

Mycoplasmal pelvic inflammatory disease

0.9 0.9

Bolivian hemorrhagic fever

Dengue hemorrhagic fever

0.8 0.6

Acute apical abscess

Chronic apical abscess

0.9 0.7

�31

Table 4.8 Similarity degrees between concepts using our proposed measure with human expert results

Ref. code: 25595822040902FWM


Chapter 5Conclusions and Recommendations

Measuring semantic similarity between ontology concepts is the important

research area, such as structuring of textual resources. In the biomedical domain,

determining the similarity degrees between ontology disease concepts in order to

recommend the similar or alternative treatments is very important research area for

the heath decision support system. The basic way to determine the similarity for the

ontology concepts depends on the taxonomical paths but there are different types of

ontology concepts and most of the concepts are needed to redefine with their

complete information. Therefore, existing ontology taxonomic-based similarity

measures could not give desired similarity degrees with the human expert.

In this thesis, we proposed a new concept name similarity measure based on

ontology concept labels by effectively capture the syntactic and semantic information

for the similarity measurement. Moreover, we made three different experiments for

finding the similarity between (1) primitive concepts, (2) primitive concepts and

defined concepts, and (3) defined concepts. And then, calculate the correlation values

as well as error values to prove the utility of our proposed measure. Furthermore, we

revised the existing ontology-based similarity measures based on the SNOMED CT

medical ontology and point out the limitations and weakness of these measures based

on three different experiments.

In conclusion, experiments show that our proposed measure surpasses existing

taxonomic-based measures for all types of ontology concepts. In the future, we will

apply our proposed measure to other medical ontologies such as MeSH ontology to

estimate the advantage of our proposed measure.

�32Ref. code: 25595822040902FWM


References

1. Resnik, P. (1999). Semantic Similarity in a Taxonomy. An Information-based

Measure and its Application to Problems of Ambiguity in Natural Language.

Journal of Artificial Intelligence Research, 11, 95-130.

2. Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E. G.M., & Milios, E. (2006)

Information Retrieval by Semantic Similarity. International Journal of Semantic

Web Informatics Systems, 55-73.

3. Abdelrahman, A. M. B., & Kayed, A. (2015). A Survey on Semantic Similarity

Measures between Concepts in Health Domain. American Journal of

Computational Mathematics, 5, 204-214.

4. WordNet database, http://wordnet.princeton.edu.

5. UMLS Terminology Services, https://uts.nlm.nih.gov/home.html.

6. SNOMED CT: Systematized Nomenclature of Medicine - Clinical Terminology.

http://www.snomed.org/snomedct/index.html.

7. Zhang, M., Patrick, J., Truran, D., and Innes, K. Deriving a SNOMED CT Data

Model

8. Medical Subject Headings (MeSH), National Library of Medicine, http://

www.nlm.nih.gov/mesh.

9. Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013). Using

of Jaccard Coefficient for Keywords Similarity. In Proceedings of the

International MultiConference of Engineers and Computer Scientists (IMECS), 1,

Hong Kong.

10. Sree, K.P.N.V.S., & Murthy, J.V.R. (2012). Clustering Based on Cosine Similarity

Measure. In International Journal of Engineering Science and Advanced

Technology (IJESAT), 2(3), 508-512.

11. Choi, J., Oh, T., & Kweon, I.S. (2016). Human Attention Estimation for Natural

Images: An Automatic Gaze Refinement Approach. Korea Advanced Institute of

Science and Technology (KAIST), Jan.

�33Ref. code: 25595822040902FWM


12. Jimenez, S., Becerra, C., & Gelbukh, A. (2013). Softcardinality-core: Improving

Text Overlap with Distributional Measures for Semantic Textual Similarity.

Second Joint Conference on Lexical and Computational Semantics, 1, 194-201,

Atlanta, Georgia.

13. Wolk, K. & Marasek, K. (2014). A Sentence Meaning Based Alignment Method

for Parallel Text Corpora Preparation. In Proceedings of New Perspectives in

Information Systems and Technologies, 1, 229-237, Spinger, Switzerland.

14. McCallum, A. (2006). String Edit Distance (and intro to dynamic programming)

Computational Linguistics, Spring.

15. Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and

Application of a Metric on Semantic Nets. IEEE Transactions Systems, Man and

Cybernetics. 19(1), 17-30.

16. Zare, M., Pahl, C., Nilashi, M., Salim, N., & Ibrahim, O. (2015). A Review of

Semantic Similarity Measures in Biomedical Domain Using SNOMED CT. Soft

Computing and Decision Support Systems, 2(6), 1-13.

17. Pedersen, T., Pakhomov, S. V. S., Patwardhan, S., & Chute, C. G. (2006)

Measures of Semantic Similarity and relatedness in the Biomedical Domain.

Journal of Biomedical Informatics, 40, 288-299.

18. Garla, V. N., & Brandt, C. (2012). Semantic Similarity in the Biomedical

Domain: An Evaluation across Knowledge Sources. Journal of BMC

Bioinformatics, October.

19. Choi, I., & Kim, M. (2003). Topic Distillation using Hierarchy Concept Tree.

Proceedings of the 26th annual international ACM SIGIR Conference on Research

and Development in Information Retrieval. 371-371, Toronto, Canada.

20. Mubaid, H. A., & Nguyen, H. (2006). A Cluster-based Approach for Semantic

Similarity in the Biomedical Domain. Proceedings of the 28th IEEE EMBS

Annual International Conference. New York City, USA.

21. Batet, M., Sanchez, D., & Valls, A., (2011). An Ontology-baed Measure to

Compute Semantic Similarity in Biomedicine. Journal of Biomedical Informatics,

44, 118-125.

�34Ref. code: 25595822040902FWM


22. Tongphu, S., & Suntisrivaraporn, B. (2015). Algorithms for Measuring Similarity

Between ELH Concept Descriptions: A Case Study on SNOMED CT, Journal of

Computing and Informatics, 20.

23. Lieberman, M., & Sproat, R. (1992). The Stress and Structure of Modified Noun

Phrases in English, Stanford University.

24. Petrakis, E. G. M., Varelas, G., Hliaoutakis, A., & Raftopoulou, P. (2006) X-

Similarity: Computing Semantic Similarity between Concepts from Different

Ontologies. Journal of Digital Information Management, 4(4).

25. Ko, S., Han, Y., & Salomma, K. (2016). Approximate Matching between a

Context-free Grammar and a Finite-state Automaton: Information and

Computation, 278-289.

26. IHTSDO. SNOMED Licensing, International Health Terminology Standards

Development Organization, http://www.ihtsdo.org/licensing.

�35Ref. code: 25595822040902FWM


AppendixList of Publications

A.1 International Conference

1. Htun, H. H., Sornlertlamvanich, V., & Suntisrivaraporn, B. (2016). Towards

Automatic Generation of “Preference Profile” for Primitive Concept Similarity

Measures on SNOMED CT. In the Eleventh International Conference on

Knowledge, Information and Creativity Support Systems, Yogyakarta, Indonesia,

194-199.

2. Htun, H. H., Sornlertlamvanich, V. (2017). Text Similarity Approach for

SNOMED CT Primitive Concept Similarity Measure. In the Eight International

Conference on Information and Communication Technology for Embedded

Systems (ICICTES), Thailand.

3. Htun, H. H., Sornlertlamvanich, V. (2017) SNOMED CT Primitive Concept

Similarity Measure by Concept Name Text Similarity Approach. In the 27th

International Conference on Information Modeling and Knowledge Bases (EJC),

Krabi, Thailand.

�36Ref. code: 25595822040902FWM

Documents

Concept name similarity measure on SNOMED CT, Concept name ...ethesisarchive.library.tu.ac.th/thesis/2016/TU_2016_5822040902_6475_4523.pdf · 1.1 Text Similarity Measuring the similarity