12
Towards a context sensitive approach to searching information based on domain specific knowledge sources Duy Dinh , Lynda Tamine IRIT Laboratory, University of Toulouse, 31062 Toulouse, France article info Article history: Available online 10 December 2011 Keywords: Context sensitive information retrieval Document expansion Query expansion Biomedical information retrieval abstract In the context of document retrieval in the biomedical domain, this paper introduces a novel approach to searching for biomedical information using contextual semantic information. More specifically, we pro- pose to combine the contextual semantic information in documents and user queries in an attempt to improve the performance of biomedical information retrieval (IR) systems. Contextual information pro- vides knowledge about a domain in a global context or statistical properties of a sub collection of docu- ments related to a given query in a local context. In our context sensitive IR approach, terms denoting concepts are extracted from each document using several biomedical terminologies. Preferred terms denoting concepts are used to enrich the semantics of the document content via document expansion. The user query is expanded using terms extracted from the top-ranked expanded documents via a blind feedback query expansion approach. In addition, we aim to evaluate the utility of incorporating several terminologies within the proposed context sensitive approach. The experiments carried out on the TREC Genomics 2004 and 2005 test sets show that our context-sensitive IR approach significantly outperforms state-of-the-art baseline approaches. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Biologists search for literature on a daily basis using commercial literature search engines such as PubMed, or Google Scholar. For example, their information needs may involve published articles describing the specifics of how genes contribute to disease in organisms. However, none of those search tools provide explicit support for genomic-focused queries. Information retrieval (IR) is a scientific research field concerned with the design of models and techniques for selecting relevant information in response to user queries within a collection (cor- pus) of documents. Two main steps characterizing an IR process are document indexing and document–query matching. The objec- tive of the indexing stage is to assign to each document in the col- lection the set of words, terms or concepts expressing the topic(s) or subject matter(s) addressed in the document. The matching stage aims at identifying the most valuable documents that better fit the query. Several and different issues arise from both indexing and matching in IR [4]. In this paper, we are interested particularly in biomedical IR where collections entail medical knowledge and queries cover the information needs of physicians, researchers in the biomedical domain or more generally users of biomedical search tools. Within the context of the biomedical sciences, there is actually a real need to develop efficient and effective IR systems for helping scientists to find desired information from biomedical literature. Indeed, in recent years, the Genomic IR has attracted a lot of talented IR researchers. Several IR approaches have been proposed to improve the biomedical IR effectiveness using a classical vs. a semantic IR approach. Classical or traditional IR approaches rely on the word-based representations of query and documents in the collection. The doc- ument–query matching between keywords from the user’s query and documents is realized under the basic term independence assumption [29]. The specification of the user information need is completely based on words figuring in the original query in order to retrieve documents containing those words. Such approaches have been limited due to the absence of relevant keywords as well as the term variation in documents and user’s query (e.g., acronyms, homonyms, synonyms, etc.). These issues have been addressed in semantic IR approaches which take into account the meaning of terms and semantic relatedness between senses in ter- mino-ontological resources for enhancing the document/query representations. Semantic IR approaches are an attempt to go beyond simple term matching by relaxing the strong assumption of term indepen- dence and also to cope with term variation in documents/queries [14,22,5,37,9,34]. The centerpiece of semantic IR models is how to identify terms denoting domain concepts in documents/queries 1570-8268/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2011.11.009 Corresponding author. Tel.: +33 561556300. E-mail addresses: [email protected] (D. Dinh), [email protected] (L. Tamine). Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52 Contents lists available at SciVerse ScienceDirect Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: http://www.elsevier.com/locate/websem

Towards a context sensitive approach to searching information based on domain specific knowledge sources

Embed Size (px)

Citation preview

Page 1: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52

Contents lists available at SciVerse ScienceDirect

Web Semantics: Science, Services and Agentson the World Wide Web

journal homepage: ht tp : / /www.elsevier .com/ locate/websem

Towards a context sensitive approach to searching informationbased on domain specific knowledge sources

Duy Dinh ⇑, Lynda TamineIRIT Laboratory, University of Toulouse, 31062 Toulouse, France

a r t i c l e i n f o

Article history:Available online 10 December 2011

Keywords:Context sensitive information retrievalDocument expansionQuery expansionBiomedical information retrieval

1570-8268/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.websem.2011.11.009

⇑ Corresponding author. Tel.: +33 561556300.E-mail addresses: [email protected] (D. Dinh), tamine@ir

a b s t r a c t

In the context of document retrieval in the biomedical domain, this paper introduces a novel approach tosearching for biomedical information using contextual semantic information. More specifically, we pro-pose to combine the contextual semantic information in documents and user queries in an attempt toimprove the performance of biomedical information retrieval (IR) systems. Contextual information pro-vides knowledge about a domain in a global context or statistical properties of a sub collection of docu-ments related to a given query in a local context. In our context sensitive IR approach, terms denotingconcepts are extracted from each document using several biomedical terminologies. Preferred termsdenoting concepts are used to enrich the semantics of the document content via document expansion.The user query is expanded using terms extracted from the top-ranked expanded documents via a blindfeedback query expansion approach. In addition, we aim to evaluate the utility of incorporating severalterminologies within the proposed context sensitive approach. The experiments carried out on the TRECGenomics 2004 and 2005 test sets show that our context-sensitive IR approach significantly outperformsstate-of-the-art baseline approaches.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Biologists search for literature on a daily basis using commercialliterature search engines such as PubMed, or Google Scholar. Forexample, their information needs may involve published articlesdescribing the specifics of how genes contribute to disease inorganisms. However, none of those search tools provide explicitsupport for genomic-focused queries.

Information retrieval (IR) is a scientific research field concernedwith the design of models and techniques for selecting relevantinformation in response to user queries within a collection (cor-pus) of documents. Two main steps characterizing an IR processare document indexing and document–query matching. The objec-tive of the indexing stage is to assign to each document in the col-lection the set of words, terms or concepts expressing the topic(s)or subject matter(s) addressed in the document. The matchingstage aims at identifying the most valuable documents that betterfit the query. Several and different issues arise from both indexingand matching in IR [4]. In this paper, we are interested particularlyin biomedical IR where collections entail medical knowledge andqueries cover the information needs of physicians, researchers inthe biomedical domain or more generally users of biomedicalsearch tools.

ll rights reserved.

it.fr (L. Tamine).

Within the context of the biomedical sciences, there is actuallya real need to develop efficient and effective IR systems for helpingscientists to find desired information from biomedical literature.Indeed, in recent years, the Genomic IR has attracted a lot oftalented IR researchers. Several IR approaches have been proposedto improve the biomedical IR effectiveness using a classical vs. asemantic IR approach.

Classical or traditional IR approaches rely on the word-basedrepresentations of query and documents in the collection. The doc-ument–query matching between keywords from the user’s queryand documents is realized under the basic term independenceassumption [29]. The specification of the user information needis completely based on words figuring in the original query in orderto retrieve documents containing those words. Such approacheshave been limited due to the absence of relevant keywords as wellas the term variation in documents and user’s query (e.g.,acronyms, homonyms, synonyms, etc.). These issues have beenaddressed in semantic IR approaches which take into account themeaning of terms and semantic relatedness between senses in ter-mino-ontological resources for enhancing the document/queryrepresentations.

Semantic IR approaches are an attempt to go beyond simpleterm matching by relaxing the strong assumption of term indepen-dence and also to cope with term variation in documents/queries[14,22,5,37,9,34]. The centerpiece of semantic IR models is howto identify terms denoting domain concepts in documents/queries

Page 2: Towards a context sensitive approach to searching information based on domain specific knowledge sources

42 D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52

in a given context (lexical chains, ontologies, semantic networks, orin an entire collection or a sub-collection) in order to highlight thesemantics of the document/query. However, research work alongthese lines have reported mixed results. The reason could be dueto the nature of the user tasks (e.g., writing a report, making a deci-sion, etc.), the knowledge about the domain (e.g., user’s back-ground knowledge or expertise), the knowledge about theproblem structure (e.g., cognitive structure of expert informationseekers, computational artifacts), or also the choice of the concep-tual representation, i.e., the context where concepts are extractedfor a particular document/query and how the conceptual modelis used to represent the document/query, etc. For the last reason,conceptual IR systems are involved with several parameters: thevocabulary or terminology employed, the concept extractionmethod or also the strategy in which extracted concepts are usedto represent the semantics of the document/query.

In this paper, we present a novel context sensitive IR approachto searching for information based on domain knowledge sourcesand statistical properties of the sub-collection. More specifically,we propose to combine the document’s global context (domainknowledge sources) and the query’s local context (top-ranked doc-uments) in an attempt to increase the term overlap between theuser’s query and documents in the collection through documentexpansion (DE) and query expansion (QE).

� For DE, documents are expanded with concepts extracted fromseveral terminologies. We propose to combine conceptsextracted from multiple terminologies using data fusion tech-niques to extract the most important concepts for each docu-ment. The terminologies, where terms denoting concepts areextracted, are referred to as the global context of the document.� For QE, queries are expanded with terms extracted from the

top-ranked expanded documents obtained in the first retrievalstage using statistical measures to compute the most relatedterms of each query. The top-ranked expanded documents,where related terms of the query are examined, are referredto as the local context.

The remainder of the paper is structured as follows: Section 2provides an overview of related work in biomedical IR dealing withterminological resources. Section 3 presents our concept-based IRapproach taking into account the semantic context of the docu-ment/query for improving biomedical IR effectiveness. Section 4describes our experimental methodology and results. We then dis-cuss several aspects of our IR approach in Section 5, prior to con-clude the paper and outline directions for future work.

2. Background and related work

This section describes relevant background knowledge aboutterminologies used in the biomedical IR domain and relevantworks related to our interests. We first present some terminologiescurrently used in literature, viewed a global context where medicalconcepts as well as the relationships between them are defined.Afterwards, we present some data smoothing techniques (e.g.,document expansion, query expansion) which can be categorizedinto two main approaches: local context analysis vs. global contextanalysis. By global context, we mean that concepts or related termsare extracted using a knowledge source or a whole collectionindependently from the input text (document or query). By localcontext, related terms or concepts are extracted for a giventext (document or query) using statistical properties of the sub-collection (top-ranked documents, k nearest concepts, etc.) relatedto the corresponding text. Finally, we summarize some relatedworks dealing with search context for enhancing document/query

representations using either a local context (e.g., a sub-collection,top-ranked documents) or global context (e.g., a whole collection,a single terminology or several terminologies).

2.1. Biomedical terminologies as a global context

Several biomedical terminologies have been used by differentgroups of research in IR, especially in the context of TREC Genom-ics. The motivation of TREC Genomics was to support research anddevelopment in biomedical IR to drive new experimental researchin the area of drug discovery for diseases. Since the commence-ment of TREC Genomics in 2003, several participants have triedto improve the performance of classical IR approaches by incorpo-rating domain knowledge sources into a conceptual IR model. Gen-erally speaking, conceptual IR model can be viewed a context-sensitive model because conceptual information are extractedwithin a particular context, e.g., thesaurus, ontology, or relateddocuments, etc. We review in what follows the most termino-ontological resources that have been widely used for indexing bio-medical documents: MeSH, ICD-10, SNOMED and GO.

2.1.1. Medical Subject HeadingsThe Medical Subject Headings (MeSH) thesaurus is the stan-

dardized vocabulary developed by the National Library of Medicinefor indexing, cataloging, and searching biomedical literature. Cur-rently, it contains more than 25,000 terms (called descriptors ormain headings) that describe biomedical concepts used in biomed-ical citations in a bibliographic database, e.g. MEDLINE. MeSHdescriptors are organized into 16 categories, each of which is di-vided into more specific subcategories. Within each category,descriptors are organized in a hierarchical structure of up to elevenlevels. In addition, MeSH uses ‘‘Entry Term’’ and ‘‘See also’’ refer-ences to indicate semantic relations such as synonyms, near-syn-onyms, and related concepts of a particular term.

Although MeSH is comprehensive and well maintained, it hasseveral drawbacks. First, the synonymous relationship is notclearly listed and not differentiated from the related term relationin MeSH. Second, many descriptors do not have corresponding ‘‘En-try’’ vocabularies listed, which means that synonyms cannot befound for many terms in MeSH. Third, the design of MeSH doesnot follow the ANSI thesaurus standard, which results in problemsof interoperability and reusability.

2.1.2. International Statistical Classification of DiseasesThe International Statistical Classification of Diseases, 10th revi-

sion (ICD-10) is a medical classification list for the coding of dis-eases, signs and symptoms, abnormal findings, complaints, socialcircumstances and external causes of injury or diseases, as main-tained by the World Health Organization (WHO). The ICD is theinternational standard diagnostic classification for all general epi-demiological, many health management purposes and clinicaluse. These include the analysis of the general health situation ofpopulation groups and monitoring of the incidence and prevalenceof diseases and other health problems in relation to other variablessuch as the characteristics and circumstances of the individualsaffected, reimbursement, resource allocation, quality andguidelines.

It is used to classify diseases and other health problemsrecorded on many types of health and vital records including deathcertificates and health records. In addition to enabling the storageand retrieval of diagnostic information for clinical, epidemiologicaland quality purposes, these records also provide the basis for thecompilation of national mortality and morbidity statistics byWHO Member States.

Page 3: Towards a context sensitive approach to searching information based on domain specific knowledge sources

D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52 43

2.1.3. Systematized Nomenclature of MedicineThe Systematized Nomenclature of Medicine (SNOMED) is a

multi-axial coded nomenclature developed and supported by theCollege of American Pathologists. Different from a classificationsystem, SNOMED is a multi-axial coded medical nomenclature,which allows the recording of all disease entities regardless ofprevalence, as well as all observations related to any particularcase. SNOMED covers all fields of medicine as well as human den-tistry and veterinary medicine. More details about the descriptionof each axis can be found in [7].

2.1.4. Gene OntologyThe Gene Ontology (GO) project provides an ontology of defined

terms representing gene product properties. The ontology coversthree domains: cellular component, the parts of a cell or its extra-cellular environment; molecular function, the elemental activitiesof a gene product at the molecular level, such as binding or catal-ysis; and biological process, operations or sets of molecular eventswith a defined beginning and end, pertinent to the functioning ofintegrated living units: cells, tissues, organs, and organisms.

2.2. Data smoothing techniques in context: query expansion vs.document expansion

In order to close the semantic gap between the user’s query anddocuments in the collection, several research works have been fo-cused on applying data smoothing techniques such as documentexpansion and query expansion on the original document/query.Theoretically, such techniques allow to enhance the semantics ofthe document/query by bringing the query closer to the relevantdocuments in the collection. As stated earlier, semantic informa-tion can be detected in a global context (usually from a domainknowledge source or an entire collection) or a local context (usu-ally from a sub collection of related top-ranked documents).

The principle goal of QE is to increase the search performanceby increasing the likelihood of term overlap between a given queryand documents that are likely to be relevant to the user query. Cur-rent approaches of QE can be subdivided into two main categories:global analysis [30,35,20,15,32,21] and local analysis [27,31,2,1].Global techniques aim to discover word relationships in a large col-lection (global context) such as Web documents [30] or externalknowledge sources like Wordnet [35], MeSH [15,21] or UMLS[37,20] or multiple terminological resources [32]. Local techniquesemphasize the analysis of the top-ranked documents (local con-text) retrieved for a given query in the previous retrieval stage[31,1].

Similar to QE approaches, DE can help to enhance the semanticsof the document by expanding the document with the most infor-mative terms. This technique has been used recently in the contextof textual document IR [6,33] as well as in the context of biomed-ical IR [20,15]. There are two principle ways of document expan-sion: local context vs. global context. In a local context (e.g., knearest related documents) of a given document, similar termsare extracted to highlight related subject matters in the document.In a global context, terms are extracted from a whole collection(usually a set of concepts, or a whole set of documents). The differ-ence between DE and QE is basically the timing of the expansionstep. In DE, terms are expanded during the indexing phase for eachindividual document while in QE only query terms are expanded atthe retrieval stage.

2.3. Related Work

Biomedical IR has attracted a lot of talented IR researchers overthe last two decades and is getting more and more attention fromthe IR community. In the area of Genomics, users queries tend to

focus on genes and their corresponding proteins. More specifically,geneticists are interested in the role of genes and proteins in bio-logical processes in the body through their interactions with othergenes and their products. The main goal of the Genomics task is toprovide support for knowledge discovery of new methods for dis-ease prevention and treatment. One of the most challengesencountered by any IR system is to deal with term variation in nat-ural language. Many research works have focused on the use ofknowledge sources or terminologies for indexing and retrievingbiomedical documents [3,28,1,15,18,10]. The idea is to bring thedocument representation to a conceptual representation level bymeans of concepts, which can be extracted manually or automati-cally from documents/queries. Manual concept extraction isundertaken by human experts with many years of experience.Automatic concept extraction is less likely to be expensive in termsof costs and time and thus could be an alternative for helping themanual task. Several works have been extensively studied inliterature to extract concepts from documents/queries[19,3,28,38,13,18].

2.3.1. Mono-terminology approaches to semantic IR in the biomedicaldomain

In the biomedical domain, several works have been undertakenin an attempt to enhance the semantics of document and/or theuser’s query using several semantic data smoothing techniquesincluding QE [36,1,21,32] and/or DE techniques [20,15]. The workin [1] adapted the local analysis QE approach for evaluating theIR performance on searching MEDLINE documents. Their approachrelies on a blind feedback by selecting the best terms from the top-ranked documents in a local context of the query. Candidate termsfor QE are weighted using the linear combination of the within-query term frequency and the inverse document frequency accord-ing to whether the term appears in the query and/or the document.Furthermore, they compared the performance of MeSH-basedmanual DE to the classical IR approach. They reported that MeSHbased DE outperforms the baseline. Using a global approach, thework in [21] investigated QE using MeSH to expand terms thatare automatically mapped to the user query via the Pubmed’sAutomatic Term Mapping (ATM) service, which basically maps un-tagged terms from the user query to lists of pre-indexed terms inPubmed’s translation tables (MeSH, journal and author). [15] com-bined both QE and DE using the MeSH thesaurus to retrieve med-ical records in the ImageCLEF 2008 collection. More concretely,they combined an IR-based approach of QE and DE for a conceptualindexing and retrieval purpose. For each MeSH concept, its syn-onyms and description are indexed as a single document in an in-dex structure. A piece of text, the query to the retrieval system, isclassified with the best ranked MeSH concepts. Finally, identifiedterms denoting MeSH concepts are used to expand both the docu-ment and the query.

2.3.2. Multi-terminology approaches to semantic IR in the biomedicaldomain

In order to enhance the semantics of documents indexed in ahealth portal, Pereira et al. [25] proposed a multi-terminology con-cept extraction approach based on the bag-of-words representa-tion of concepts in ontologies and documents in the collection. Intheir approach, each sentence in the document is represented asmultiple bags of words independently to the word order correla-tion between words in the sentence and the ones in concepts.According to their evaluation using a set of five different terminol-ogies (MeSH, ICD10, SNOMED, CCAM and TUV) on a small collec-tion of 18,814 documents indexed manually by four professionalexperts, they concluded that the multi-terminology approach out-performs the concept extraction relying on a single terminology interms of recall. Similarly, Darmoni et al. [8] presented a multi-ter-

Page 4: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Fig. 1. The multi-terminology based indexing and retrieval process.

44 D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52

minology approach for assigning biomedical concepts issued frommore terminologies (MeSH, ICD10, SNOMED, CCAM, TUV, ATC, INN,Orphanet thesaurus, MeSH Supplementary concepts) to documentsin the CISMeF portal but the concept extraction between free textand terminologies is based on the simple bag-of-wordsrepresentation.

In the context TREC Genomics evaluation, Stokes et al. [32]exploited several medical knowledge sources such MeSH, Entrezgene, SNOMED, UMLS, etc. for expanding the query with syn-onyms, abbreviations and hierarchically related terms identifiedusing the PubMed’s automatic term mapping service. Furthermore,they also defined several rules for filtering the candidate termsaccording to each knowledge source. Zhou et al. [36] proposed aknowledge-intensive conceptual retrieval by combining both theglobal context (i.e., concepts in several termino-ontological re-sources such as MeSH, Entrez Gene, ADAM). MeSH terms are iden-tified using the Pubmed ATM service, Entrez Gene is used foridentifying gene names and symbols while abbreviations are rec-ognized using the abbreviation database namely ADAM. They com-bined the knowledge sources (global context) and the local context(top-ranked documents) of the query in a query expansion settingand reported an improvement of 23% over the baseline.

Biomedical concept extraction whether based on the use of ter-minologies or not, is a key component of semantic IR approaches.Indeed, concepts extracted from documents allow to representthe subject matters of each document via MeSH terms (main head-ings or subject headings) [25,8,23]. Concepts extracted from theuser query allows to enrich the semantics of the query via queryexpansion [37,36,15]. However, to the best of our knowledge,works dealing with biomedical terminologies typically focusedon the evaluation of the indexing performance in the context ofan information extraction task or typically focused on the evalua-tion of the query expansion performance in the context of an IRtask. There is so far no work investigating the evaluation of themulti-terminology indexing for biomedical IR, i.e. the utility ofindexing biomedical documents using multiple terminologies forbiomedical information retrieval.

Our contributions in this paper are essentially to evaluate theimpact of using terminological resources for detecting contextualinformation in the document content and the user’s query forimproving biomedical IR effectiveness. Compared to previouswork, our major contributions are three-fold:

� First, we use an approximate concept extraction method toidentify concepts in each document using a mono terminology.Candidate concepts are weighted to measure their relevance tothe document.

� Second, we apply the concept extraction process on several ter-minologies and combine several concept lists using voting tech-niques. We see each concept identified from each documentusing multiple terminologies as an implicit vote for the docu-ment. Therefore, the multi-terminology based concept extrac-tion can be modeled as a voting problem. The final conceptlist is considered to be revealing the document’s subject mat-ter(s) and could be used for DE/QE.� Third, unlike previous works [36,15,32,21], which only focus on

QE/DE using the global context (UMLS, MeSH, etc.) or only QE/DE using the local context (corpus-based) [6,33] or even onlyQE using both the local and global context [36], we aim to pointout that the combination of the document’s global context(knowledge sources) and the query’s local context (top-rankeddocuments) could be a source evidence to improve the biomed-ical IR effectiveness.

3. Our context-sensitive IR approach

Our context sensitive IR approach relies on two main steps de-tailed below: (1) Conceptual Document Indexing and (2) Context Sen-sitive Document Retrieval. We integrate them into a biomedical IRprocess as the combination of the global and local semantic con-texts for improving the biomedical IR effectiveness. The contextualsemantic information is detected using domain knowledge sourcesand statistical information in a sub-collection. The former is re-ferred to as global context while the latter is referred to as localcontext. Therefore, the contextual semantic information of a givenquery is revealed by concepts extracted from the global contextduring document expansion and related terms from a local contextduring query expansion. Fig. 1 depicts the two main stages of ourbiomedical IR approach.

During the indexing stage, each document in the collection isanalyzed to extract the most significant concepts using several ter-minologies. Our assumption behind multi-terminology based con-cept extraction is that the more concepts are found in severalterminologies, the more they are important in the description ofthe document since they are well recognized in several sub do-mains of medicine. For concept extraction, we adopt MaxMatcher,which is an approximate lookup based on dictionary matching[38]. Given a document, MaxMatcher will extract a set of termsor phrases denoting domain concepts as well as their correspond-ing concept unique identifiers (CUIs). However, MaxMatcher doesnot measure the importance of each concept for describing thesemantics of the document. To achieve this, we use the BM25 termweighting model [26] to measure the degree of description of eachconcept to the semantics of the document. Formally:

Page 5: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Table 1Description of the voting techniques used for a multi-terminology based concept extraction.

Category Technique scoreðcj;DÞ Description

Rank-based CombRankPn

i¼1ðkRðD; TiÞk � rDji Þ Sum of concept ranks

CombRCPPn

i¼11=rDji

Sum of inverse concept ranks

Score-based CombSUMPn

i¼1wDji

Sum of concept scores

CombMIN minfwDji ; i ¼ 1::ng Minimum concept sores

CombMAX maxfwDji ; i ¼ 1::ng Maximum concept scores

CombMED medianfwDji ; i ¼ 1::ng Median of concept scores

CombANZPn

i¼1wDji � kfcj 2 RðD; TÞgk CombSUM � kfcj 2 RðD; TÞgk

CombMNZPn

i¼1wDji � kfcj 2 RðD; TÞgk CombSUM � kfcj 2 RðD; TÞgk

D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52 45

wDji ¼ 1=‘ �

X‘k¼1

tf ðtkÞ �log

N � nk þ 0:5nk þ 0:5

k1 � ðð1� bÞ þ b � dlavg dl

Þ þ tf ðtkÞð1Þ

where tk is the constituent1 k of concept cj; tf ðtkÞ is the number ofoccurrences of term tk in document D; N is the total number of doc-uments in the collection; nk is the number of documents containingword tk; dl is the document length; avg dl is the average documentlength; k1, and b are parameters; ‘ is the number of words compris-ing concept cj.

Let CðDÞ be the set of concepts extracted from document D andCðTiÞ be the set of concepts defined in terminology Ti. For each doc-ument D, the list of candidate concepts, denoted RðD; TiÞ ¼fcjj8cj 2 CðDÞ ^ cj 2 CðTiÞg, is extracted using terminology Ti. Weneed to find the final set RðD; TÞ containing the most relevantconcepts for document D among the ones identified from severalterminologies: RðD; TÞ ¼

Sni¼1RðD; TiÞ, where T ¼ fT1; T2; . . . ; Tng;n

is the number of terminologies used for indexing.During the retrieval stage, the top k terms from the h top-ranked

expanded documents retrieved from the first retrieval stage areused to expand the original user’s query. We then detail our con-text-sensitive IR approach via the two main steps: (1) ConceptualDocument Indexing and (2) Context Sensitive Information Retrieval.

3.1. Conceptual document indexing: How to extract key concepts frommultiple terminologies?

Given a collection of documents and n terminologies used forindexing, we first extract concepts from each document D usinga particular terminology Ti, i.e., we will obtain n lists of conceptsfor document D. We need to fuse n concept lists to obtain a finallist of unique concepts representing various subjects matters ofdocument D. Our fusion method for concept extraction is typicallybased on well known data fusion techniques (e.g., CombMAX,CombMIN, CombSUM, CombMNZ, etc.) that have been used tocombine data from different information sources [12]. Our purposehere is to select the best concepts issued from several terminolo-gies by means of voting scores assigned to candidate concepts.For this purpose, we propose to combine rankings of the extractedconcepts from each document using their matching scores and/ortheir ranks. Intuitively speaking, the concept fusion can be seenas the voting problem described as follows.

We compute the combined score of the candidate concept cj

voting for document D, given its score wDji and rank rD

ji when usingterminology Ti, as the aggregation of votes of all identified con-cepts. We consider two sources of evidence when aggregatingthe votes to each candidate concept: (E1) Scores of the identifiedconcept voting for each document; (E2) Ranks of the identifiedconcept voting for each document.

1 In this paper, a constituent is a term forming a part of a concept. For example,‘breast’ and ‘neoplasms’ are two constituents of concept ‘‘breast neoplasms’’.

We evaluate 8 voting techniques based on known data fusionmethods [12], which aggregate the votes from several rankings ofconcepts into a single ranking, using both the ranks and/or scoresof candidate concepts. The lists of extracted concepts from eachdocument using several terminologies are merged together to ob-tain a final single concept list representing the document’s subjectmatter(s). The optimal number of extracted concepts is retained forexpanding the document content, in an attempt to enhance itssemantics. Such a technique is also known as DE or concept taggingfor document smoothing in the context of a specific domain.

Table 1 depicts all the voting techniques that we use and eval-uate in this work. They are grouped into two categories accordingto the source of evidence used. The k:k operator indicates the num-ber of concepts having non-zero score in the described set; rD

ji is therank of concept cj defined in terminology Ti and extracted fromdocument D; and wD

ji is the score of concept cj, defined in Ti and ex-tracted from document D, computed using the probabilistic BM25scheme [26].

3.2. Context sensitive document retrieval

The document retrieval aims at matching the user’s query todocuments in order to retrieve a list of results that may satisfythe user information need. In our work, we use well establishedprobabilistic BM25 term weighting model [26] to rank documents,which are expanded with extracted concepts using multiple termi-nologies, w.r.t a user query, where the relevance score of a docu-ment D for a query Q is:

scoreðD;QÞ ¼Xt2Q

ðk1 þ 1Þ � tfnK þ tfn

� ðk3 þ 1Þ � qtfk3 þ qtf

�wð1Þ ð2Þ

where

� tfn is the normalized within-document term frequency givenby:

tfn ¼ tfð1þ bÞ þ b � dl

avg dl

; ð3Þ

where tf is the within-document term frequency, dl and avg dlare respectively the document length and average documentlength,� k1; k3 and b are tuning parameters,� K is k1 � ðð1� bÞ þ b � dl=avg dlÞ,� qtf is the within-query term frequency,� wð1Þ is the idf (inverse document frequency) factor computed

as:

wð1Þ ¼ log2N � Nt þ 0:5

Nt þ 0:5ð4Þ

where N is the total number of documents in the collection andNt is the number of documents containing term t (documentfrequency).

Page 6: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Fig. 2. Variation of document length in TREC Genomics collections.

46 D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52

In order to improve the IR performance, we aim to expand theuser’s query with related terms extracted from the top-ranked

expanded documents returned by the BM25 function. Our QE ap-proach is based on statistical properties of a sub-collection, whichis referred to as local context QE. More specifically, the local contextQE applies a blind-feedback technique to select the best termsfrom the top-ranked expanded documents in the first retrievalstage. In this expansion process, terms in the top-returned docu-ments are weighted using a particular Divergence From Random-ness (DFR) term weighting model [2]. In our work, the Bose-Einstein statistics [2] is used to weight terms in the expandedquery qe derived from the original query q. Formally:

weightðt 2 qeÞ ¼ tfqn þ b � InfoBo1

MaxInfoð5Þ

where

� tfqn ¼ tfqmaxt2qtfq: the normalized term frequency in the original

query,� MaxInfo ¼ arg maxt2qe max InfoBo1,� InfoBo1 is the normalized term frequency in the expanded query

induced by using the Bose-Einstein statistics, that is:

InfoBo1 ¼� log2ProbðFreqðtjKÞjFreqðtjCÞÞ

¼ � log21

1þ k

� �� FreqðtjKÞ � log2

k1þ k

� �ð6Þ

2 A MEDLINE citation is a reference to an original journal article which has beenselected for indexing in MEDLINE. Each one contains several fields including anidentifier, a title, an abstract, several authors, etc. It is provided with a dozen of MeSHterms manually assigned by librarians.

where Prob is the probability of obtaining a given frequency of theobserved term t within the topmost retrieved documents, namely K;C is the set of documents in the collection; k ¼ FreqðtjCÞ

N , with N is thenumber of documents in the collection, b ¼ 0:4. The number of top-ranked documents and the number of terms expanded to the origi-nal query are tuning parameters.

4. Experimental evaluation

The objectives of the experimental evaluation were:

1. To determine the utility of the combination of the globalcontext (knowledge sources) of the document and the localcontext (top-ranked documents) of the query via document

expansion and query expansion, which we believe to be arelevant source of evidence for resolving the term mismatchproblem in biomedical IR;

2. To demonstrate the advantages to integrate several biomedicalterminologies into a biomedical IR process; and

3. To compare the IR performance of the multi-terminology index-ing to state-of-the-art IR approaches.

We describe in what follows the datasets, the experimental setup,the evaluation measures and then present and discuss the results.

4.1. Datasets

4.1.1. Test collectionsWe validate our concept-based IR approach using two collec-

tions: TREC Genomics 2004 [16] and TREC Genomics 2005 [17],which are the subset of about 4.6 millions MEDLINE citations2 from1994 to 2003, under the Terrier IR platform [24]. TREC Genomics testcollections have been created since 2003. The 2004–2005 TRECGenomics collections aim at evaluating the performance of ad hoc re-trieval at the level of document. The latest TREC Genomics collectionwas released in 2006 and has been reused in 2007 but the objectivehas been changed: unlike the objective of the 2004–2005 collections,this collection aims at providing a benchmark for evaluating the per-formance of retrieval of exact answer passages in response to naturallanguage questions (question answering-style task). Our prototypeIR system deals with biomedical information retrieval at the levelof document. Therefore, the 2004–2005 datasets are the most appro-priate for evaluation of our IR approach.

Human relevance judgments were merely made to a relativesmall pool, which were built from the top-precedence run fromeach of the participants. Our prototype IR system only indexesand searches all human relevance judged documents, i.e. the unionof 50 single pools containing 48,753 citations in TREC Genomics2004 and 41,018 ones in TREC Genomics 2005. We only applied

Page 7: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Table 2Descriptive statistics of terminologies.

Number of concepts Number of entries

MeSH 25,586 221,004SNOMED 106,397 150,400ICD-10 9,250 11,630GO 27,050 90,261

D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52 47

our conceptual indexing and retrieval method on titles and ab-stracts of MEDLINE citations. Fig. 2 depicts the variation of docu-ment length of both TREC Genomics 2004 and TREC Genomics2005 documents. The average document length of TREC Genomics2004 is 122 terms and the one of TREC Genomics 2005 is 134terms.

There are 50 queries in the TREC Genomics 2004 with an aver-age query length of 17 informative terms3 and 49 queries in TRECGenomics 2005 with an average query length of 9 informative terms.

4.1.2. Biomedical terminologiesIn our experiments described later, we primarily used four bio-

medical knowledge sources namely MeSH, SNOMED, ICD-10 andGO released in 2010 as controlled terminologies for indexing bio-medical documents. Table 2 depicts some characteristics of eachterminology where the total number of unique concepts as wellas the total number of their entries are reported. Each entry, whichcan be a synonym, a lexical variant or an acronym, corresponds to aterm that refers to a particular concept. Each concept can havemore than one entry term.

4.2. Evaluation measures

In general, the IR performance must guarantee two measures:effectiveness and efficiency. Specifically, the first one reveals thecapacity of the IR system retrieving the most relevant informationw.r.t the user’s information need, while the second one reveals thecapacity of the IR system providing fast and ordered access to largeamounts of information. In the context of TREC experiments, weare only interested in the effectiveness. For measuring the IR effec-tiveness, we used the MAP metrics representing the Mean AveragePrecision calculated over all the queries. The average precision of aquery is computed by averaging the precision values computed foreach relevant retrieved document of rank x 2 ð1; . . . ;KÞ, whereK ¼ 1000 is the number of retrieved documents. Our MAP resultsare generated by the trec eval standard tool4 used by the TREC com-munity for evaluating ad hoc retrieval runs.

4.3. Experimental setup

In order to evaluate the utility of our context sensitive IR ap-proach based on the use of domain knowledge sources and statis-tical properties in the sub-collection, we carried out three series ofexperiments:

� The first one (our strong baseline) is based on classical indexingof title and abstract based articles using the well-establishedprobabilistic model BM25 [26].� The second one concerns our mono-terminology IR approach

and consists of three sub-scenarios:1. The first one concerns document expansion using MeSH

concepts identified by MaxMatcher [38], denotedDEautomatic or simply DE.

3 An informative term or also keyword descriptor is a term that is useful fordescribing the semantics of the document.

4 http://trec.nist.gov/trec_eval/

2. The second one concerns the query expansion using ablind feedback technique on original documents (title+ab-stract) without DE (see formula 5), denoted QE,

3. The last one concerns our method which relies on thecombination of both QE and the automatic DE strategyas described above, denoted QEþ DE.

� the third one concerns our multi-terminology IR approachwhere four terminologies MeSH, SNOMED, ICD-10 and GO arebuilt into four dictionaries employed by MaxMatcher, whichgenerates four concept lists for each document. First, conceptsare extracted using each terminology separately to evaluatethe influence of using different single terminologies for biomed-ical IR. Afterwards, we applied several voting techniques formerging the final list of identified concepts as described in Sec-tion 3.1.

Table 3 illustrates a MEDLINE citation that is expanded withterms denoting MeSH concepts extracted using MaxMatcher andterms denoting concepts extracted from the document contentusing several terminologies. Fields such as TITLE and ABSTRACT rep-resent the document content and KERNEL MESH;KERNEL CombMNZrepresent the concepts extracted from documents.

4.4. Experimental results

Due to the lack of training dataset in the TREC Genomics 2004and 2005 collections, we consider using the TREC Genomics 2004collection as the training dataset for testing our IR approach onthe 2005 collection, and vice versa. Therefore, the training datasetis different from the testing dataset in our experiments.

4.4.1. Tuning parameter resultsThere are three important tuning parameters in the BM25

model, namely k1; k3 and b (see formula 2). In order to optimizethe IR performance, these parameters must be tuned appropriatelyto obtain the best configuration. In our work, we suppose that thehyper parameter b is the most important parameter reflecting theimpact of the document length and average document length inthe collection on the IR performance, we only optimize the IR per-formance by tuning b from a set of typical values, which aref0:25; 0:50; 0:75; 1:00; 1:25; 1:50; 1:75; 2:00g and k1 and k3 are setto be 1.2 and 8.0, respectively [2]. For automatic QE, according toour previous work [11], we extract the 20 most informative termsfrom 20 top-ranked documents in the first retrieval stage. First ofall, we aim to estimate the number of concepts, namely Nc , whichcan be used to expand the document content. On the two collec-tions, we tune parameter Nc along with parameter b to find outthe optimal values of each parameter.

Tables 4 and 5 show the MAP results of the automatic DE meth-od using MeSH concepts extracted by MaxMatcher. We tuned thenumber of extracted concepts, namely Nc , which will be expandedto each document from 5 to 50 for TREC Genomics 2004 with a stepof 5 and the term frequency hyper parameter b from 0.25 to 1.50with a step of 0.25. We did the same procedure on the TRECGenomics 2005 collection with the exception that parameter Nc

is tuned from 2 to 10 with a step is 2 and when Nc is greater than10, this step is 5. This allows us to determine the optimal numberNc of candidate concepts expanded to each document as well as theoptimal value of parameter b. When Nc gets over 30, the IR perfor-mance becomes saturated. This could be explained by the fact thata maximum number of 30 concepts in average are extracted fromeach MEDLINE document.

It is noticeable that the retrieval performance of the BM25 base-line scheme and the automatic DE method differs between theTREC Genomics 2004 and TREC Genomics 2005 tasks: the best

Page 8: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Table 4MAP results for MeSH-based DE on TREC Genomics 2004.

Nc b

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

5 0.4153 0.4197 0.4222 0.4199 0.4039 0.3544 0.1853 0.148110 0.4164 0.4203 0.4231 0.4213 0.4060 0.3535 0.1851 0.158815 0.4176 0.4214 0.4232 0.4232 0.4080 0.3533 0.1848 0.157420 0.4177 0.4215 0.4237 0.4242 0.4084 0.3542 0.1848 0.157025 0.4176 0.4215 0.4237 0.4243 0.4084 0.3542 0.1847 0.156730 0.4179 0.4218 0.4239 0.4244 0.4085 0.3543 0.1847 0.156635 0.4179 0.4218 0.4239 0.4244 0.4085 0.3543 0.1847 0.156640 0.4179 0.4218 0.4239 0.4244 0.4085 0.3543 0.1847 0.156745 0.4179 0.4218 0.4239 0.4244 0.4085 0.3543 0.1847 0.156750 0.4179 0.4218 0.4239 0.4244 0.4085 0.3543 0.1847 0.1567

Table 5MAP results for MeSH-based DE on TREC Genomics 2005.

Nc b

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

2 0.2412 0.2442 0.2470 0.2452 0.2453 0.2391 0.2337 0.22674 0.2415 0.2443 0.2463 0.2444 0.2440 0.2400 0.2347 0.23005 0.2413 0.2436 0.2455 0.2443 0.2438 0.2396 0.2348 0.23036 0.2414 0.2446 0.2504 0.2455 0.2454 0.2403 0.2355 0.23118 0.2381 0.2408 0.2421 0.2417 0.2414 0.2386 0.2332 0.2296

10 0.2381 0.2401 0.2414 0.2403 0.2398 0.2369 0.2324 0.228115 0.2391 0.2414 0.2419 0.2411 0.2413 0.2391 0.2334 0.228420 0.2243 0.2270 0.2280 0.2277 0.2290 0.2257 0.2220 0.219125 0.2241 0.2261 0.2281 0.2281 0.2283 0.2258 0.2212 0.218330 0.2239 0.2255 0.2275 0.2277 0.2287 0.2254 0.2213 0.218035 0.2242 0.2260 0.2276 0.2280 0.2292 0.2254 0.2210 0.218140 0.2242 0.2260 0.2276 0.2281 0.2291 0.2253 0.2208 0.218145 0.2242 0.2259 0.2275 0.2281 0.2289 0.2254 0.2206 0.218150 0.2240 0.2258 0.2275 0.2282 0.2288 0.2254 0.2205 0.2180

Table 3Example of a MEDLINE citation that is expanded with concepts extracted by MaxMatcher using a single terminology (MeSH) and multiple terminologies (MeSH, SNOMED, ICD-10and GO). Extracted concepts are ranked according to their weight in a descending order.

<DOC><DOCNO>10605437</DOCNO><TITLE>Structural conformation of ciliary dynein arms and the generation of sliding forces in Tetra-hymena cilia.</TITLE><ABSTRACT>‘‘The sliding tubule model of ciliary motion requires that active sliding of microtubules occur by cyclic cross-bridging of the dynein arms. When isolated,

demembranated Tetrahymena cilia are allowed to spontaneously disintegrate in the presence of ATP, the structural conformation of the dynein arms can be clearlyresolved by negative contrast electron microscopy.’’ . . . ‘‘Because the base-directed polarity of the bridged arms is opposite to the direction required for forcegeneration in these cilia and because the bridges occur in the presence of ATP, it is suggested that the bridged conformation may represent the initial attachmentphase of the dynein cross-bridge cycle. The force-generating phase of the cycle would then require a tip-directed deflection of the arm subunit attached to the Bsubfiber.’’

</ABSTRACT><KERNEL_MESH>tetrahymena (C0039679; 8,3433)� ; dynein (C0013352; 8,2212); arm (C0446516; 6,4393); motion (C0026597; 5,4157); displacement (C0012725; 5,1163);</KERNEL_MESH><KERNEL_CombMNZ>dynein (C0013352; 49,3272); motion (C0026597; 32,4942); atp (C0001480; 25,4916); electron microscopy (C0026019; 21,2070); tetrahymena (C0039679; 16,6866);</KERNEL_CombMNZ></DOC>

� (A;B): A corresponds to a Concept Unique Identifier (CUI) and B the score of the extracted concept.

48 D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52

MAP obtained by BM25 is 0.4146 on TREC Genomics 2004 and0.2476 on TREC Genomics 2005 while our automatic DE methodachieves the best MAP of 0.4243 on the former and 0.2504 onthe latter. The difference in terms of MAP could be due to the nat-ure of the task of each year. In 2004, the topics are more descriptivewith title and description as well as search context indicating thebackground information to place information need in contextwhile in 2005 the topics are given with a set of limited terms, most

of which are gene and protein names with abbreviations alongwith their long forms. For the next experiments, we retainNc ¼ 30 (resp. Nc ¼ 6) for DE on the TREC Genomics 2004 (resp.TREC Genomics 2005) and parameter b is set to 0.75 on the formerand 1.00 on the latter for retrieval.

Fig. 3 depicts the IR performance in terms of MAP of our auto-matic DE on TREC Genomics of each year when compared to thebaseline BM25 weighting scheme. The improvement rates in terms

Page 9: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Fig. 3. MAP results obtained by tuning parameter b of the BM25 model and the MeSH based DE.

Table 6MAP results of the baseline BM25, DE and/or QE.

TREC Genomics 2004 TREC Genomics 2005

MAP D (%) Recall D (%) MAP D (%) Recall D (%)

BM25 0.4146 0.8367 0.2476 0.8178DE 0.4243 ðþ2:34Þ 0.8480 +01.35 0.2504 ðþ1:13Þ 0.7923 -03.12QE 0.4475 ðþ7:99Þ 0.8687 +03.82 0.2500 ðþ0:97Þ 0.8312 +01.64QEþ DE 0:4507� ðþ8:77Þ 0.8650 +03.38 0:2639� ðþ6:58Þ 0.8497 +03.90

Paired sample t-test:� Significant (p < 0:05).

D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52 49

of MAP of our DE method range from +2.13% to +10.51% on theTREC Genomics 2004 collection and from +0.04% to +39.30% onthe TREC Genomics 2005 collection. When b is from 0 to 1, theMAP obtained by the BM25 model tends to be stable, but when bis greater than 1, the MAP decreases dramatically. By smoothingthe document content with extracted terms denoting concepts,our DE method outperforms the BM25 scheme whatever the valueof b is. This clearly proves the interest to smooth the documentcontent using concepts extracted from domain knowledge sourceto solve the problem of term mismatch, especially in the biomed-ical domain.

4.4.2. Effectiveness of MeSH based indexingAt this level, we aim to demonstrate the utility of the combina-

tion of DE and QE for improving biomedical IR performance. Table6 shows the MAP results of the classical and the MeSH-based IR ap-proaches (cf. Section 4.3). According to the results, we see that DEgives slight improvement rates compared to the baseline BM25 onboth test collections: +2.34% for TREC Genomics 2004 and +1.13%for TREC Genomics 2005. QE outperforms the BM25 by yieldingan improvement rate of +7.99% on the TREC Genomics 2004 buta slight improvement of +0.97% on the TREC Genomics 2005. Thecombination of DE and QE is effective on both test collectionsand gives an improvement of +8.77% in terms of MAP on the TRECGenomics 2004 collection and +6.58% on the TREC Genomics 2005collection.

In terms of recall, we observe that DE alone only yields a smallimprovement rate of +1.35% on TREC Genomics 2004 but does nothelp to improve the recall on TREC Genomics 2005. QE alone yields

an improvement of +3.82% on the former, but only +1.64% on thelatter. The combination of QE and DE is more stable and shows aconsistent improvement on both TREC Genomics 2004 and 2005collections (+3.38% and +3.90%). This proves clearly the effect ofdocument expansion in combination with query expansion onthe biomedical IR performance.

As shown in Table 6, the paired-sample T-tests computed be-tween MAP rankings of the combination of document and querysemantic contexts, namely QEþ DE, and the baseline in each TRECyear (e.g., in TREC Genomics 2004: M ¼ 0:0532; t ¼ 2:0756; df ¼49; p ¼ 0:0432 and in TREC Genomics 2005: M ¼ 0:1363; t ¼2:1940; df ¼ 48; p ¼ 0:0331) show that our context sensitive IRapproach is statistically significant compared to the baseline.

4.4.3. Effectiveness of multi-terminology indexingTable 7 shows the IR performance of both mono- and multi-ter-

minology IR on the TREC Genomics 2004 and 2005 collections. Wecompared the results obtained by the mono- and multi-termino-logical indexing to the median run of all participants of each TRECyear. According to the results, we see that most of the IR scenariosbased on terminological indexing either mono or multi-termino-logical indexing outperform the median run of each TREC year.

Within a mono-terminological setting, MeSH-based indexingyields better results than using each of other terminologies. Thisis straightforward because MEDLINE documents are currentlyindexed using MeSH terms, each of which represents a subjectmatter of the document. In particular, we observe that for TRECGenomics 2004, the MAP results of MeSH-based indexing andGO-based indexing are very competitive (0.4412 vs. 0.4408). For

Page 10: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Table 7Retrieval effectiveness of MaxMatcher and the 8 voting techniques on the TREC Genomics 2004 and TREC Genomics 2005 collections. Submitted runs in TREC are ranked by MAP.

Run TREC Genomics 2004 TREC Genomics 2005

MAP D (%) MAP D (%)

Median 0.2074 0.2173Mono-terminology indexing and retrievalMeSH 0:4412a (+112.73) 0.2639 (+21.45)SNOMED 0:4222a (+103.57) 0.2630 (+21.03)ICD-10 0:4138a (+99.52) 0.2592 (+19.28)GO 0:4408a (+112.54) 0.2536 (+16.71)Multi-terminology indexing and retrievalCombANZ 0:4435a (+113.84) 0.2647 (+20.89)CombMAX 0:4387a (+111.52) 0.2684a (+23.52)CombMED 0:4459a (+115.00) 0.2683a (+23.47)CombMIN 0:4440a (+114.08) 0.2685a (+23.56)CombMNZ 0.4529a (+118.37) 0.2593 (+19.33)CombRank 0:4407a (+112.49) 0.2594 (+19.37)CombRCP 0:4371a (+110.75) 0.2601 (+19.70)CombSUM 0:4470a (+115.53) 0.2601 (+19.70)

a Significant changes at p 6 0:05;0:01 and 0.001.

Table 8The comparison of our best run with official runsparticipated in TREC 2004 Genomics Track.

Run MAP

pllsgen4a2 (the best) 0.4075uwmtDg04tn (the second) 0.3867pllsgen4a1 (the third) 0.3689THUIRgen01 (the fourth) 0.3435PDTNsmp4 (median) 0.2074edinauto5 (the worst) 0.0012CombMNZ (our best run) 0.4529

50 D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52

TREC Genomics 2005, MeSH-based indexing gives the similar re-sults as SNOMED-based indexing (0.2639 vs. 0.2630). On both TRECGenomics 2004 and 2005 collections, MeSH-based indexing yieldthe best MAP results. For this reason, we choose MeSH-basedindexing as the reference to compare to the results obtained byseveral voting techniques.

Within a multi-terminological setting, we see that most of thevoting techniques lead to a consistent improvement over medianruns. For instance, applying the CombMNZ fusion technique onthe TREC Genomics 2004 collection results in an increase up to+118.37% in terms of MAP over the median run. The improvementrate is better than using a mono terminology (+112.73% vs.+118.37%). The CombMNZ technique takes into account the scoreof the extracted concept as well as the number of terminologieswhere the concept is defined. Therefore, we think that highlyweighted concepts that are defined in several terminologies tendto give the most important vote for the document and so to repre-sent better the semantics of the document. For the TREC Genomics2005 collection, the improvement rates of the voting techniques(ranging from +19.33% to +23.56%) are smaller but always resultin a statistically significant increase in terms of MAP. The bestMAP values are obtained using different voting techniques suchas CombMIN;CombMAX or CombMED since the MAP resultsobtained by those techniques are very competitive.

As shown in Table 7, the paired-sample T-tests computedbetween MAP rankings of the median run in TREC Genomics2004 and each of our run (e.g., CombMNZ : M ¼ 0:2455; t ¼6:8517; df ¼ 49; p ¼ 0:001) shows that our multi-terminologybased IR approach is statistically significant compared to the base-line. For TREC Genomics 2005, our indexing approach yields smal-ler MAP improvements but that are always statistical significantcompared to the TREC Genomics 2005 median run (e.g.,CombMIN : M ¼ 0:0513; t ¼ 2:1407; df ¼ 48; p ¼ 0:0374).

We summarize the utility of our multi-terminology indexingand retrieval based on different voting techniques as follows: ingeneral, the indexing based on combining several terminologiesusing different voting technique allows a better IR performancein terms of MAP compared to the median run. The improvementrate compared to the baseline varies according to the voting tech-nique used. In a mono-terminology setting, the use of concepts ex-tracted from documents can be a relevant source of evidence toimprove the IR performance. Indeed, extracted concepts are usedto expand the document content in order to highlight the mostimportant subject matters in each document (MEDLINE citationin our study). By expanding preferred terms denoting concepts inthe document, we can normalize the document content so to

enhance the semantics of the document. In a multi-terminologicalsetting, we aim to highlight the most important concepts whichare defined in several terminologies using their mappings betweeneach pair of terminologies (defined in UMLS). The improvementrate of each voting technique depends on the characteristics ofeach collection and requires extensive training experiments.According to the results obtained on two collections, we see thatthe three voting techniques namely CombANZ, CombMED andCombMIN are stable because the IR performance in terms of MAPobtained by those methods allow a consistent and statisticalimprovement over the median run and the mono-terminologybased IR.

4.5. Comparative Evaluation

We further compare the results obtained by our context sensi-tive IR approach to the results obtained by participants in bothTREC Genomics 2004 and 2005 tracks. Table 8 depicts the compar-ative results of our best run with official runs of participants in theTREC 2004 Genomics track. The results show that our multi-terminology based indexing and retrieval method (CombMNZ)outperforms the best run submitted to TREC Genomics 2004(MAP=0.4075) with a gain of +11.14%. Our best results on the TRECGenomics 2005 collection are found in the top four of automaticbest runs in TREC Genomics 2005 (MAP of the fourth-best run is0.2580) given that the MAP results of the top three best runs arevery competitive (see Table 9). Most runs in the TREC Genomics2004 and 2005 tracks extensively apply various query expansionand pseudo-relevance feedback techniques to their IR modelswhile our IR approach tries to maximize the likelihood of observingquery terms in documents by expanding documents content withkey concepts extracted from either a single terminology or multi-ple terminologies.

Page 11: Towards a context sensitive approach to searching information based on domain specific knowledge sources

Table 9The comparison of our best run with official runsparticipated in TREC 2005 Genomics Track.

Run MAP

york05ga1 (the best) 0.2888ibmadz05us (the second) 0.2883ibmadz05bs (the third) 0.2859uwmtEg05 (the fourth) 0.2580NTUgah1 (median) 0.2173edinauto5 (the worst) 0.0544CombMIN (our best run) 0.2685

D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52 51

5. Discussion

To cope with the term mismatch problem in the biomedical do-main, IR systems must bring the users’ vocabulary closer to theauthors’ vocabulary. One of the appropriate solutions is to use ter-minologies as means of normalizing query/document vocabulary.For instance, document contents are smoothed or expanded withpreferred terms denoting concepts could help to improve the IReffectiveness. Indeed, many research works have repeatedly dem-onstrated the added value of using MeSH terms which are manu-ally or semi-automatically expanded to the MEDLINE citation[31,3,1]. In a completely automatic setting, we demonstrate thatautomatic concept extraction whether based on a mono-terminol-ogy or multiple terminologies could be an effective way to improvethe IR performance. As shown in Table 6, within a mono-terminol-ogy indexing schema, the combination of the global context DE andlocal context QE consistently and significantly improves the BM25baseline system. Furthermore, several terminologies can be com-bined together by using data fusion techniques to produce a coher-ent concept list representing the document’s subject matter(s). Thefinal concept list containing concepts extracted from several termi-nologies could be regarded as the document’s semantic kernelwhich is finally used for document expansion. The query expansionaims at retrieving more relevant documents by expanding the ori-ginal query with relevant terms extracted either from original doc-uments or from those expanded with extracted concepts.

Several research works in the general domain have demon-strated that the local context QE techniques are quite effective.However, within the biomedical IR domain, such techniques maygive no improvement in terms of MAP probably because they donot deal with term variation in natural language. For example, asshown in Table 6, although the local context QE gives a betterimprovement in terms of MAP on the TREC Genomics 2004 collec-tion (+7.99%), it gives a very small improvement in terms of MAPon the TREC Genomics 2005 collection (+0.97%) compared to thebaseline. In this case, the global context DE combined with the lo-cal context QE allows picking up more relevant terms from the ex-panded documents in order to better improve the IR effectiveness.Furthermore, our context-sensitive IR method based on the combi-nation of DE and QE shows stable IR performance with a significantimprovement rate of +8.77% on the former and +6.58% on the lattercollection.

In a multi-terminology setting, our MAP results consistentlyand statistically outperform the results obtained by the medianof all participants in each TREC year. Since the added value of themulti-terminology IR approach is modest compared to the mono-terminology (MeSH-based) IR approach (the improvement of thebest results obtained by the CombMNZ (resp. CombMIN) methodcompared to QEþ DE is +0.49% (resp. +1.74%) on TREC Genomics2004 (resp. TREC Genomics 2005)), in future work, we aim to studythe importance of each terminology where concepts are extracted.For example, GO concepts may be better scored than those comingfrom SNOMED.

6. Conclusion

In this paper, we have proposed a novel IR method for combin-ing the global context DE and the local context QE. The resultsdemonstrate that our IR approach shows a significant improve-ment over the classical IR. The best results of our conceptual IR ap-proach are significantly superior to the median of official runs inTREC 2004 & 2005 Genomic Tracks and are comparable to the bestruns. In addition, we have also proposed a novel multi-terminologyapproach to biomedical IR. We argued that concept extractionusing multiple terminologies can be regarded as a voting problemtaking into account the rank and score of identified concepts. Theextracted concepts are used for DE and QE in an attempt to closethe semantic gap between the user’s query and documents in thecollection. The results demonstrate that our multi-terminology IRapproach shows a significant improvement over the median runssubmitted in TREC Genomics 2004 and 2005 tracks.

Our future work aims at incorporating our multi-terminology IRinto a semantic model taking into account the concept centralityand specificity, which we believe to be able to overcome the limitsof the bag-of-words based models. We also plan to combine sev-eral dictionary-based and statistical concept extraction methodsby leveraging the advantages of each method. We believe that con-cepts extracted from several methods would enhance the conceptextraction accuracy. Finally, it is also interested to integrate otherbiomedical ontologies such as EFO (Experimental Factor Ontology)and OBI (Ontology for Biomedical Investigations) into our contextsensitive IR approach to annotate various semantic informationin biomedical literature.

References

[1] S. Abdou, J. Savoy, Searching in Medline: query expansion and manualindexing evaluation, Information Processing Management 44 (2008) 781–789.

[2] G. Amati, Probabilistic models for Information Retrieval based on Divergencefrom Randomness. Ph.D. thesis. Department of Computing Science, Universityof Glasgow, 2003.

[3] A.R. Aronson, J.G. Mork, S.M.H. CW Gay, W.J. Rogers, The NLM IndexingInitiative’s Medical Text Indexer, in: Proceedings of MedInfo, 2004, pp. 268–272.

[4] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, AddisonWesley, 2005.

[5] M. Baziz, M. Boughanem, N. Aussenac-Gilles, C. Chrisment, Semantic cores forrepresenting documents in IR, in: Symposium on Applied Computing, 2005, pp.1011–1017.

[6] B. Billerbeck, J. Zobel, Document expansion versus query expansion for ad-hocretrieval, in: A. Turpin, R. Wilkinson (Eds.), Proceedings of the 10thAustralasian Document Computing Symposium, Sydney, Australia, 2005, pp.34–41.

[7] R.A. Cote, Architecture of SNOMED: Its Contribution to Medical LanguageProcessing, in: Proceedings of Annual Symposium on Computer Application inMedical Care, 1986, pp. 74–80.

[8] S.J. Darmoni, S. Pereira, S. Sakji, T. Merabti, É. Prieur, M. Joubert, B. Thirion,Multiple Terminologies in a Health Portal: Automatic Indexing andInformation Retrieval, in: Proceedings of Artifical Intelligence in MEdicine,2009, pp. 255–259.

[9] D. Dinh, L. Tamine, Sense-based biomedical indexing and retrieval, in:International Conference on Applications of Natural Language to InformationSystems, Springer-Verlag, Cardiff, UK, 2010, pp. 24–35.

[10] D. Dinh, L. Tamine. Biomedical concept extraction based on combining thecontent-based and word order similarities, in: Proceedings of the 2011 ACMSymposium on Applied Computing, ACM, New York, NY, USA, 2011a, pp.1159–1163.

[11] D. Dinh, L. Tamine, Combining Global and Local Semantic Contexts forImproving Biomedical Information Retrieval, in: Proceedings of the 33thEuropean Conference on Information Retrieval, 2011b, pp. 375–386.

[12] E.A. Fox, J.A. Shaw, Combination of Multiple Searches, in: Proceedings of TextREtrieval Conference, 1994, pp. 243–252.

[13] K. Frantzi, S. Ananiadou, H. Mima, Automatic recognition of multi-word terms:the C-value/NC-value method, International Journal on Digital Libraries 3(2000) 115–130.

[14] R. Gaizauskas, K. Humphreys, Using a semantic network for informationextraction, Natural Language Engineering 3 (1997) 147–169.

[15] J. Gobeill, P. Ruch, X. Zhou, Query and document expansion with MedicalSubject Headings terms at medical Imageclef 2008, in: Proceedings of the 9thCross-language evaluation forum conference on Evaluating systems for

Page 12: Towards a context sensitive approach to searching information based on domain specific knowledge sources

52 D. Dinh, L. Tamine / Web Semantics: Science, Services and Agents on the World Wide Web 12–13 (2012) 41–52

multilingual and multimodal information access, Springer-Verlag, Berlin,Heidelberg, 2009, pp. 736–743.

[16] W.R. Hersh, R.T. Bhuptiraju, L. Ross, P. Johnson, A.M. Cohen, D.F. Kraemer, TREC2004 Genomics Track Overview, in: Proceedings of Text REtrieval Conference,2004.

[17] W.R. Hersh, A.M. Cohen, J. Yang, R.T. Bhupatiraju, P.M. Roberts, M.A. Hearst,TREC 2005 Genomics Track Overview, in: Proceedings of Text REtrievalConference, 2005.

[18] A. Hliaoutakis, K. Zervanou, E.G.M. Petrakis, The AMTEx approach in themedical document indexing and retrieval application, Data KnowledgeEngineering 68 (2009) 380–392.

[19] M. Krauthammer, A. Rzhetsky, et al., Using BLAST for Identifying Gene andProtein Names in Journal Articles, in: Gene, 2000, pp. 245–252.

[20] D.T.H. Le, J.P. Chevallet, T.B.T. Dong, Thesaurus-based query and documentexpansion in conceptual indexing with UMLS, in: Proceedings of Conferenceon Research, Innovation and Vision for the Future, 2007, pp. 242–246.

[21] Z. Lu, W. Kim, W.J. Wilbur, Evaluation of query expansion using MeSH inPubMed, Information Retrieval 12 (2009) 69–80.

[22] D.I. Moldovan, R. Mihalcea, Using WordNet and lexical operators to improveinternet searches, IEEE Internet Computing 4 (2000) 34–43.

[23] A. Névéol, S.E. Shooshan, S.M. Humphrey, J.G. Mork, A.R. Aronson, A recentadvance in the automatic indexing of the biomedical literature, Journal ofBioInformatics 42 (2009) 814–823.

[24] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, C. Lioma, Terrier: A HighPerformance and Scalable Information Retrieval Platform, in: Proceedings ofSIGIR’06 Workshop on Open Source Information Retrieval (OSIR 2006), 2006.

[25] S. Pereira, A. Neveol, G. Kerdelhué, E. Serrot, M. Joubert, S.J. Darmoni, Usingmulti-terminology indexing for the assignment of MeSH descriptors to healthresources in a French online catalogue, in: AMIA Symposium, 2008, pp. 586–590.

[26] S.E. Robertson, S. Walker, Hancock-M. Beaulieu, Okapi at TREC-7: AutomaticAd Hoc, Filtering, VLC and Interactive, in: Proceedings of Text REtrievalConference, 1998, pp. 199–210.

[27] J. Rocchio, Relevance Feedback in Information Retrieval (1971) 313–323.[28] P. Ruch, Automatic assignment of biomedical categories: toward a generic

approach, Bioinformatics 22 (2006) 658–664.[29] G. Salton, C. Buckley, C.T. Yu, An evaluation of term dependence models in

information retrieval, in: Proceedings of the 5th annual ACM conference onResearch and development in information retrieval, Springer-Verlag New York,Inc., New York, NY, USA, 1982, pp. 151–173.

[30] K. Spearck Jones, Automatic Keyword Classification for Information Retrieval,Butterworth, London, 1971.

[31] P. Srinivasan, Query expansion and MEDLINE, Information Processing andManagement 32 (1996) 431–443.

[32] N. Stokes, Y. Li, L. Cavedon, J. Zobel, Exploring criteria for successful queryexpansion in the genomic domain, Information Retrieval 12 (2009) 17–50.

[33] T. Tao, X. Wang, Q. Mei, C. Zhai, Language model information retrieval withdocument expansion, in: Proceedings of Conference on Association forComputational Linguistics, 2006, pp. 407–414.

[34] D. Trieschnigg, Proof of Concept: Concept-based Biomedical InformationRetrieval. Ph.D. thesis. University of Twente, 2010.

[35] E.M. Voorhees, Query expansion using lexical semantic relations, in:Proceedings of Conference on Research and Development in InformationRetrieval, Springer-Verlag New York, Inc., New York, NY, USA, 1994, pp. 61–69.

[36] W. Zhou, C. Yu, N. Smalheiser, V. Torvik, J. Hong, Knowledge-intensiveconceptual retrieval and passage extraction of biomedical literature, in:Proceedings of Conference on Research and Development in InformationRetrieval, 2007, pp. 655–662.

[37] X. Zhou, X. Hu, X. Zhang, X. Lin, I. yeol Song, Context-sensitive semanticsmoothing for the language modeling approach to genomic IR, in: Proc. ofConference on Research and Development in Information Retrieval, ACM,2006a pp. 170–177.

[38] X. Zhou, X. Zhang, X. Hu, MaxMatcher: biological concept extraction usingapproximate dictionary lookup, in: Proceedings of the Pacific RimInternational Conference on Artificial Intelligence, 2006b, pp. 1145–1149.