View
4
Download
0
Category
Preview:
Citation preview
SETIT 2009
5th International Conference: Sciences of Electronic,
Technologies of Information and Telecommunications March 22-26, 2009 – TUNISIA
- 1 -
Semantic Search of Medical Resources on the Base of
Meta-information and Semantic Annotation
Olfa DRIDI and Mohamed BEN AHMED
RIADI Laboratory, National School of Computer Sciences, Tunis, Tunisia
Olfa.dridi@riadi.rnu.tn
Mohamed.beanahmed@riadi.rnu.tn
Abstract: Semantic search has been one of the motivations of the Semantic Web since it was envisioned. In my thesis, I
research the development of a new retrieval model for the exploitation of knowledge represented in ontologies to
improve search over large document repositories. In my proposed approach I consider an adaptation of the classic
model of Information Retrieval (IR), including a meta-information process, and a semantic annotation of document. In
this approach, semantic search is based on meta-information and semantic annotation in order to index document
semantically. The method has been tested on medical corpora, showing promising results respect to keyword-based
search, and providing ground for further analysis and research.
Key words: annotation, medical corpus, meta-information, ontology, semantic search.
INTRODUCTION
Semantic search has been one of the major
envisioned benefits of the Semantic Web since its
emergence in the late 90’s. One way to view a
semantic search engine is as a tool that gets formal
ontology-based queries (e.g. in RDQL, RQL,
SPARQL, etc.) from a client, executes them against a
knowledge base, and returns tuples of ontology values
that satisfy the query [SMM 83]. These techniques
typically use boolean search models, based on an ideal
view of the information space as consisting of non-
ambiguous, non-redundant, formal pieces of
ontological knowledge. A knowledge item is either a
correct or an incorrect answer to a given information
request, thus search results are assumed to be always
100% precise, and there is no notion of approximate
answer to an information need. While this conception
of semantic search brings key advantages already, our
work aims at taking a step beyond. In the view
proposed in this thesis for Information Retrieval in the
Semantic Web, a search engine returns documents,
rather than (or in addition to) exact values, in response
to user queries. Furthermore, as a fundamental
requirement for scaling up to massive information
sources, the engine should rank the documents,
according to concept-based relevance criteria.
A purely boolean ontology-based retrieval model
makes sense when the whole information corpus can
be fully represented as an ontology-driven knowledge
base. But there are well-known limits to the extent to
which knowledge can be formalized this way. First,
because converting the huge amount of information
currently available into formal ontological knowledge
at an affordable cost is currently an unsolved problem
in general. Second, documents hold a value of their
own, and are not equivalent to the sum of their pieces.
Third, wherever ontology values carry free text,
boolean semantic search systems do a full-text search
within the string values. If the values hold long pieces
of text, a form of keyword-based search is taking
place in practice beneath the ontology-based query
model, whereby the “perfect match” assumption starts
to become arguable. If no clear ranking criteria are
supplied, the search system may become useless if the
search space is too big.
The goal of my thesis is the development of an
ontology-based information retrieval model meant for
the exploitation of full-fledged domain ontologies and
knowledge bases, to support semantic search in
document repositories [VFC 05]. In contrast to
boolean semantic search systems, in the proposed
perspective full documents, rather than specific
ontology values from a KB, are returned in response
to user information needs. To cope with large-scale
information sources, an adaptation of the classic
vector-space model [SMM 83] is proposed, suitable
for an ontology-based representation, upon which a
ranking algorithm is defined.
SETIT2009
- 2 -
1. Ontology and information retrieval
Every domain has phenomena that people allocate
as conceptual or physical objects, connections and
situations. With the help of various language
mechanisms such phenomena contacts to the certain
descriptors (for example, names, and noun phrases).
For the successful solution of an informational
retrieval task it is necessary to present user knowledge
about domain of her/his interests in some form
suitable for computer processing. The specifications of
high-level domain are formed by integration of the
domain structures of low-level domains. It is
important to achieve an interoperability of domain
knowledge representation. Ontological approach is an
appropriate tool for solution of this task. Ontology is
an agreement about common use of concepts that
contains means of representation of subject knowledge
and agreements on methods of reasons. It can be
considered as the certain description of the views on
the world in some specific sphere of interests.
Ontology consists of: 1) a set of the terms; 2) a set of
rules of their use that limit their meanings in the
context of concrete domain [CBA 04].
The ontology is knowledge base of a special kind
with the semantic information about some domain. It
is a set of definitions in some formal language of
declarative knowledge fragment focused on joint
repeated use by the various users in the applications.
Ontological commitments are the agreements
aimed at coordination and consistent use of the
common dictionary. The agents (human beings or
software agents) that jointly use the dictionary do not
feel necessity of common) knowledge base: one agent
can know something that don't know the other ones,
and the agent that handles the ontology is not required
the answers to all questions that can be formulated
with the help of the common dictionary.
Every domain with the certain subject of research
has its own terminology, original dictionary used for
discussion of typical objects and processes of this
domain. The library, for example, involves the
dictionary relating to the books, references,
bibliographies, magazines etc. Thus, pattern of domain
is discovered by its dictionary - the set of words that
are used in this domain. Clearly, however, that the
specificity of domain is shown not only in the
appropriate dictionary. Besides, it is necessary: (i) to
provide strict definitions of grammar managing of
combining the dictionary terms into the statements,
and (ii) to clear logic connections between such
statements. Only when this additional information is
accessible, it is possible to understand both nature of
domain objects and important relations established
between them. Ontology - structured representation of
this information [GMM 03].
The formal model of domain ontology O is an
ordered triple O = < X, R, F >, where Х - finite set of
subject domain concepts that represents ontology O; R
- finite set of the relations between concepts of the
given subject domain; F - finite set of interpretation
functions of given on concepts and relations of
ontology O [FVC 06].
2. Related Work
The view of the semantic retrieval problem in this
thesis is very close to the proposals in KIM [KPT 04]
[PKO 04]. While KIM focuses on automatic population
and annotation of documents, my work focuses on the
ranking algorithms for semantic search. Along with
TAP, KIM is one of the most complete proposals
reported to date, to my knowledge, for building high-
quality KBs, and automatically annotating document
collections at a large scale. The proposed work
complements KIM and TAP with a ranking algorithm
specifically designed for an ontology-based retrieval
model, using a semantic indexing scheme based on
annotation weighting techniques.
Semantic Portals [MSS 03] typically provide
simple search functionalities that may be better
characterized as semantic data retrieval, rather than
semantic information retrieval. Searches return
ontology instances rather than documents, and no
ranking method is provided. In some systems, links to
documents that reference the instances are added in
the user interface, next to each returned instance in the
query answer [CVM 05], but neither the instances, nor
the documents, are ranked.
The ranking problem has been taken up again in
[SSS 03], and more recently [RSD 04]. Whereas both
of these works are concerned with ranking query
answers (i.e. ontology instances), my work is
concerned with ranking the documents annotated with
these answers. Since my respective techniques are
applied in consecutive phases of the retrieval process,
it would be interesting to experiment the integration of
the query result relevance function proposed by
Stojanovic et al into the document relevance measures
under definition in the thesis.
Finally, the thesis shares with Mayfield and Finin
[MFT 03] the idea that semantic search should be a
complement of keyword-based search as long as not
enough ontologies and metadata are available. Also, I
believe that inferencing is a useful tool to fill
knowledge gaps and missing information (e.g.
transitivity of the located in relationship over
geographical locations).
3. Proposed Approach
In the proposed view of semantic information
retrieval, I assume a knowledge base has been built
and associated to the information sources (the
document base), by using one or several domain
ontologies that describe concepts appearing in the
document text. The implemented system can work
with any arbitrary domain ontology with essentially no
restrictions, except for some minimal requirements,
which basically consist of conforming to a set of root
ontology classes: Concept should be the root of all
domain classes that can be used (directly or after
subclassing) to create instances that describe specific
SETIT2009
- 3 -
entities referred to in the documents.
Document is used to create instances that act as
proxies of documents from the information source to
be searched upon. Taxonomy is the root for class
hierarchies that are merely used as classification
schemes, and are never instantiated. The concepts and
instances in the Knowledge Base (KB) are linked to
the documents by means of explicit, non-embedded
annotations to the documents.
While I do not address here the problem of
knowledge extraction from text [CVM 05] [KPT 04]
[PKO 04], I use a tool to aid in the semi-automatic
annotation of documents.
The annotations are used by the information
retrieval system, to index semantically resources.
In the classic vector-space model, keywords
appearing in a document are assigned weights
reflecting that some words are better at discriminating
between documents than others. Similarly, in this
system, annotations are assigned a weight that reflects
how important the instance is considered to be for the
document meaning. Weights are computed
automatically by an adaptation of the TF-IDF
algorithm [SMM 83], based on the frequency of
concepts of the instances in each document. So, we
propose CF-IDF to compute weights of each concept.
Figure 1 : general schema of our proposed
approach
The approach proposes ontology-based
information retrieval. It can be seen as an evolution of
classic keyword-based retrieval techniques, where the
keyword-based index is replaced by a semantic
knowledge. The overall retrieval process is illustrated
in Figure. 1. The system takes as input a Natural
language query. This query can be represented by a set
of concepts.
The query is executed against the semantic
annotated corpus, which returns a list of pertinent
documents that satisfy the query and user’s profile.
Finally, the documents that are annotated with these
concepts are retrieved, ranked.
3.1. Corpus construction
We started by constructing my resource collection
from the web. We download documents from these
web sites:
3.2. Meta-information generation
We define meta-information as information about
information.
Meta-information generation is the act for creating
or producing information. Generating good quality
meta-information in an efficient manner is essential
for organizing and making accessible the growing
number of rich resources available on the web and in
corpora.
3.3. Semantic annotation
'Annotation', in contemporary English, according
to WordNet, has two meanings:
� note, annotation, notation: a comment
(usually added to a text);
� annotation, annotating -- the act of adding
notes.
In linguistics (and particularly in computational
linguistics) an annotation is considered a formal note
added to a specific part of the text. There are number
of alternative approaches regarding the organization,
structuring, and preservation of annotations. For
instance, all the markup languages (HTML, SGML,
XML, etc.) can be considered schemata for embedded
or in-line annotation.
Semantic annotation is information about what
entities (or, more generally, semantic features) appear
in a text and where they do. Formally, semantic
annotations represented a specific sort of metadata,
which provides references to concepts in ontologies.
4. Development
4.1. Our semantic space
To realize our approach, we propose two kinds of
ontologies. The first ontology, is called meta-
information ontology, and describes information
related to our domain. The second ontology, called
breast cancer ontology, describes concepts and their
relationship from medical domain.
We have used protégé-2000, as tool to construct
our ontologies.
4.1.1. Breast cancer ontology Our aim in developing the Breast Cancer ontology
is not to provide a perfect, generic ontology which
encompasses all of breast cancer. Instead, I have based
SETIT2009
- 4 -
it on relevant literature (article, course …).
We can’t represent this ontology in a small figure,
because it contains a lot of concepts.
4.1.2. Meta-information ontology In this ontology, we present information which we
consider pertinent in order to describe profile of users.
These informations are showed in Figure 2: title,
author, laboratory, school, description of resource,
key-words, speciality, public concerned, format of
resource, size of resource, subject, date, language …
Figure 2: meta-information ontology
4.2. GATE
We have used GATE, which is a large-scale
infrastructure for natural language processing
applications. Linguistic data associated with language
resources such as documents and corpora is encoded
in the form of annotations. GATE supports a variety of
formats including XML, RTF, HTML, SGML, email
and plain text. In all cases, when a document is
created/opened in GATE, the format is analysed and
converted into a single unified model.
Provided with GATE is a set of reusable
processing resources for common NLP tasks. (None of
them are definitive, and the user can replace and/or
extend them as necessary.) These are packaged
together to form ANNIE, A Nearly- New IE system,
but can also be used individually or coupled together
with new modules in order to create new applications.
For example, many other NLP tasks might require a
sentence splitter and POS tagger, but would not
necessarily require resources more specific to IE tasks
such as a named entity transducer. The system is in
use for a variety of IE and other tasks, sometimes in
combination with other sets of application-specific
modules.
ANNIE consists of the following main processing
resources: tokeniser, sentence splitter, POS tagger,
gazetteer, finite state transducer (based on GATE’s
built-in regular expressions over annotations
language), orthomatcher and coreference resolver .
The resources communicate via GATE’s
annotation API, which is a directed graph of arcs
bearing arbitrary feature/value data, and nodes rooting
this data into document content.
We used these processing resources in order to
extract document information which related to
concepts in ontologies.
These processing resources are represented in
Figure 3:
The tokeniser splits text into simple tokens, such
as numbers, punctuation, symbols, and words of
different types (e.g. with an initial capital, all upper
case, etc.). The aim is to limit the work of the
tokeniser to maximise efficiency, and enable greater
flexibility by placing the burden of analysis on the
grammars. This means that the tokeniser does not need
to be modified for different applications or text types.
The sentence splitter is a cascade of finite state
transducers which segments the text into sentences.
This module is required for the tagger. Both the
splitter and tagger are domain and application-
independent.
The tagger is a modified version of the Brill
tagger, which produces a part-of-speech tag as an
annotation on each word or symbol. Neither the
splitter nor the tagger are a mandatory part of the NE
system, but the annotations they produce can be used
by the grammar, in order to increase its power and
coverage.
personne
auteur
lecteur
laboratoire université
affecté à
appartient payssitué en
écrit avec
document
date de publication
possède
sujet
conserne
travaille sur
article rapport livre
littérature
publié dans
revueconférence
ouvrage
niveau de formation
type de formation
type
format possède
concept-clé
langage
possède
titre
formation initiale
possède
formation contenue
formation spécialisée
possède
étudiant
public cible
distiné pour
date de formation
document texte
document multimédia
rédigé par
durée
étudiant chercheur
médecin spécialiste
homme du monde
médecin généraliste
cours
SETIT2009
- 5 -
Figure 3 : different steps of ANNIE
The gazetteer consists of lists such as cities,
organisations, days of the week, etc. It not only
consists of entities, but also of names of useful
indicators, such as typical company designators (e.g.
‘Ltd.’), titles, etc. The gazetteer lists are compiled into
finite state machines, which can match text tokens.
The semantic tagger consists of handcrafted rules
written in the JAPE (Java Annotations Pattern Engine)
language, which describe patterns to match and
annotations to be created as a result. JAPE is a version
of CPSL (Common Pattern Specification Language),
which provides finite state transduction over
annotations based on regular expressions. A JAPE
grammar consists of a set of phases, each of which
consists of a set of pattern/action rules, and which run
sequentially. Patterns can be specified by describing a
specific text string, or annotations previously created
by modules such as the tokeniser, gazetteer, or
document format analysis.
The orthomatcher is another optional module for
the IE system. Its primary objective is to perform co-
reference, or entity tracking, by recognising relations
between entities. It also has a secondary role in
improving named entity recognition by assigning
annotations to previously unclassified names, based on
relations with existing entities.
The coreferencer finds identity relations between
entities in the text.
4.3. Semantic annotation.
We have annotated document by the use of GATE.
So, we can open our breast cancer ontology in GATE,
and we can annotate documents with concepts from
ontologies.
In a nutshell, Semantic Annotation is about
assigning to the entities in the text links to their
semantic descriptions (ontologies). This sort of
metadata provides both class and instance information
about the entities.
In Figure 4, we show step of annotating document
in GATE by using our Breast cancer ontology.
4.4. Semantic indexing and search
Meta-information generated and semantic
annotations provide semantic indexing and search.
This type of indexing enables new (semantically
enhanced) access methods. Thus the user could
specify queries, which consist of constraints,
regarding the types of entities, relations between the
entities, and entity attributes. E.g. one could specify
the NEs that are to be referred to in the documents of
interest, with name restrictions (e.g. a Person which
name ends with ‘Alabama’). An example of a query
consisting of pattern restrictions over entities could be:
“give me all documents referring to a breast cancer”.
To answer the query, our system applies the semantic
restrictions over the entities in the instance base. The
resulting set of document contains, for example, the
synonyms of “breast cancer” like mastectomy.
Figure 4 : semantic annotation with GATE
5. Conclusion and Future Work
This paper presented the notion of semantic search,
a new model allowing ontology-based annotation,
indexing, and retrieval.
The evaluation work that has been done until now
does not provide enough empirical justification about
the feasibility of the approach, technology, and
resources being used.
The challenges towards the general approach can
be summarized as follows:
� Develop (or adapt) an evaluation metric,
which properly measures the performance
of a semantic annotation system;
� Evaluation of the semantic IR against a
traditional IR engine, so as to formally
measure the positive effect of semantic
indexing (cutting out the irrelevant
results, because of the semantic
restrictions; and retrieving even more
correct results – e.g. when an entity is
mentioned with another alias, but is still
indexed by its unique identifier).
SETIT2009
- 6 -
REFERENCES
[CBA 04] Contreras, J., Benjamins, V. R., et al: A
Semantic Portal for the International Affairs
Sector. 14th International Conference on
Knowledge Engineering and Knowledge
Management (EKAW 2004). LNCS Vol. 3257
(2004) 203-215
[CVM 05] Castells, P., Fernández, M., Vallet, D.,
Mylonas, P., Avrithis, Y.: Self-Tuning
Personalized Information Retrieval in an
Ontology-Based Framework. 1st IFIP
International Workshop on Web Semantics
(SWWS 2005). LNCS Vol. 3532 (2005) 455-
470
[FVC 06] M. Fernández, D. Vallet, P. Castells.
Probabilistic Score Normalization for Rank
Aggregation. 28th European Conference on
Information Retrieval (ECIR 2006). London,
April 2006. Springer Verlag Lecture Notes in
Computer Science, Vol. 3936, pp. 553-556.
[GMM 03] Guha, R. V., McCool, R., and Miller, E.:
Semantic search. 12th International World
Wide Web Conference (WWW 2003).
Budapest, Hungary (2003) 700-709
[KPT 04] Kiryakov, A., Popov, B., Terziev, I., Manov,
Ognyanoff, D.: Semantic Annotation,
Indexing, and Retrieval. Journal of Web
Sematics 2:1 (2004) 49-79
[MFT 03] Mayfield, J., Finin, T.: Information retrieval
on the Semantic Web: Integrating inference
and retrieval. Workshop on the Semantic Web
at the 26th International ACM SIGIR
Conference on Research and Development in
Information Retrieval (SIGIR 2003). Toronto,
Canada (2003)
[MSS 03] Maedche, A., Staab, S., Stojanovic, N., Studer,
R., Sure, Y.: SEmantic portAL: The SEAL
Approach. In: Fensel, D., Hendler, J. A.,
Lieberman, H., Wahlster, W. (eds.): Spinning
the Semantic Web. MIT Press, Cambridge
London (2003) 317-359
[PKO 04] Popov, B., Kiryakov, A., Ognyanoff, D.,
Manov, D., Kirilov, A.: KIM – A Semantic
Platform for Information Extaction and
Retrieval. Journal of Natural Language
Engineering 10:3-4 (2004) 375- 392
[RSD 04] Rocha, C., Schwabe, D., de Aragão, M. P.: A
Hybrid Approach for Searching in the
Semantic Web. International World Wide Web
Conference (WWW 2004), New York (2004)
374-383
[SMM 83] Salton, G., McGill, M. Introduction to Modern
Information Retrieval. McGraw-Hill, New
York (1983)
[SSS 03] Stojanovic, N., Studer, R., Stojanovic, L.: An
Approach for the Ranking of Query Results in
the Semantic Web. 2nd International Semantic
Web Conference (ISWC 2003). LNCS Vol.
2870 (2003) 500-516
[VFC 05] Vallet, D., Fernández, M., Castells, P.: An
Ontology-Based Information Retrieval Model.
2nd European Semantic Web Conference
(ESWC 2005). LNCS Vol. 3532 (2005) 455-
470
Recommended